How MeridianOS Grades Its Own Work
Most geopolitical intelligence products sell analyst expertise and brand reputation. Neither is measurable. Neither is falsifiable. When a vendor's annual outlook misses the Arab Spring, the rise of ISIS, or the EU fragmentation it predicted for years that never materialized, there is no published scorecard to consult. The vendor's next annual outlook ships on schedule.
This brief explains why that matters, what a measurable alternative looks like, and how MeridianOS implements it.
The Problem with Ungraded Intelligence
In 2018, economist Dragoș Negrea published the only peer-reviewed independent audit of a major geopolitical intelligence firm's predictive accuracy. The subject was Stratfor; the window was 1995 to 2025; the findings were unflattering. Stratfor systematically missed the Arab Spring, underestimated the durability of the Iranian clerical state, failed to anticipate the formation of ISIS, and spent years predicting EU fragmentation that did not occur on its projected timeline. The audit used Stratfor's own published forecasts as the test set.
No Stratfor competitor has published a response with their own calibration numbers. That silence is not oversight. Publishing a graded track record creates liability: buyers can verify the claims, and a visible miss record is a harder sales conversation than an unblemished brand reputation. The entire category has implicitly agreed not to measure itself.
The buyer bears the cost of this arrangement. When a security team renews a six-figure vendor contract, they are doing so on the basis of confidence in the vendor's brand, relationships with the analyst team, and the subjective sense that the product is useful. None of this is wrong — experienced analysts produce valuable work. But "we think this is good" is not the same as "here is the evidence that the predictive content is accurate at a documented rate."
Calibrated forecasting is the discipline that closes that gap.
What Calibration Actually Means
Calibration has a precise definition. A forecaster is well-calibrated if, among all the predictions they made at 70% probability, approximately 70% of the corresponding events occurred. Among their 30% predictions, approximately 30% occurred. Among their 90% predictions, approximately 90% occurred.
This sounds like a low bar. It is not. Most people — including professional analysts — systematically overestimate their confidence. When asked to provide 90% confidence intervals for factual questions, the typical respondent's stated intervals contain the true answer about 50% of the time, not 90%. The gap between stated confidence and empirical accuracy is the calibration error, and closing it requires deliberate training and feedback.
The standard tool for visualizing calibration is the reliability diagram. The forecaster's predictions are grouped into probability bins (e.g., 0–10%, 10–20%, …, 90–100%). For each bin, the empirical hit rate — the fraction of events in that bin that actually occurred — is plotted against the forecast probability. A perfectly calibrated forecaster produces a diagonal line. Points above the diagonal mean the forecaster is underconfident (events occur more often than predicted); points below mean the forecaster is overconfident.
MeridianOS generates its reliability diagrams using the CORP method — Consistent, Optimally binned, Reproducible, Pool-adjacent-violators — from Dimitriadis, Gneiting, and Jordan (2021). The CORP method uses the pool-adjacent-violators algorithm to construct bins that minimize isotonic regression distance from the perfect-calibration diagonal, rather than using fixed bins that can obscure structure. The result is a diagram that is reproducible across implementations and visually honest about where calibration errors concentrate.
The Brier Score and Why a Single Number Is Not Enough
The standard accuracy metric for probabilistic forecasts is the Brier score, introduced by Glenn Brier in 1950 for weather forecasting:
BS = (1/N) Σ (pi − oi)²
Where pi is the forecast probability and oi is 1 if the event occurred and 0 if it did not. The Brier score ranges from 0 (perfect) to 1 (maximally wrong). Lower is better.
The Brier score is strictly proper: a forecaster cannot improve their expected score by reporting anything other than their true belief. Saying 70% when you believe 60% will on average hurt your score, not help it. This is the minimal honesty requirement for a scoring rule to be meaningful. Any scoring rule that is not strictly proper can be gamed.
A single Brier number, however, conflates two distinct dimensions of forecast quality. W.M. Murphy (1973) showed that the Brier score decomposes as:
BS = Reliability − Resolution + Uncertainty
Reliability is the calibration term. A forecaster who always says 50% achieves perfect reliability — they cannot be miscalibrated — but they provide no information. This is the failure mode that a single Brier number can hide.
Resolution measures discrimination: how well the forecaster separates events that occur from events that do not. High resolution means a forecaster assigns substantially higher probabilities to events that happen than to events that do not. A forecaster who always says 50% has zero resolution. Good forecasting requires both high reliability and high resolution.
Uncertainty is a property of the question set, not the forecaster. It measures how predictable the questions are in aggregate. Uncertainty is the same for every forecaster evaluated on the same questions; it sets the theoretical floor on the Brier score. A forecaster working on easy questions (elections in stable democracies, commodity price directions) will produce lower Brier scores than one working on hard questions (coup timing, attack attribution, treaty survival) even with identical calibration and discrimination skill.
MeridianOS publishes all three components separately, alongside log-loss as a secondary metric for tail-bet sensitivity. Reporting only a single Brier score — without decomposition — is how a product can claim good accuracy while hiding a calibration problem or a discrimination problem behind one another.
The Wrong-Side-of-Maybe Fallacy
The single most common buyer error in evaluating probabilistic forecasts is binary scoring: if the event happened, the forecast was "right"; if it did not happen, the forecast was "wrong." This is incorrect, and the error is consequential enough to have a name.
Consider a 30% probability forecast for an event that occurs. Under binary scoring, this is a miss. Under Brier scoring, the contribution to the score is (0.30 − 1.0)² = 0.49. For comparison, a 70% forecast that resolves correctly contributes (0.70 − 1.0)² = 0.09. The 30% call for an event that happened was not a miss in the probabilistic sense — it assigned meaningful probability to the event and correctly placed the event on the less-likely side, which is exactly what calibration requires when evidence is ambiguous. A forecaster who said 30% and was right in 30% of their 30%-probability calls has done their job.
The flip side: a 90% forecast that resolves incorrectly contributes (0.90 − 0.0)² = 0.81 to the Brier score. That is a costly error, and it should be. The forecaster assigned near-certainty to something that did not happen.
The wrong-side-of-maybe fallacy specifically: evaluating a 49% forecast as "wrong" because the event happened (or "right" because it did not) treats a probabilistic estimate as if it were a binary call. Below 50% does not mean "we don't think this will happen." It means "this is marginally more likely not to happen than to happen, given current evidence." An event at 45% probability occurring is not a failure; an event at 45% probability occurring 70% of the time is a calibration failure.
MeridianOS's public track record displays paired examples for every graded prediction: the original probability, the outcome, the Brier contribution, and the calibration interpretation. The methodology brief and each case study walk through the arithmetic explicitly so that buyers can evaluate the track record correctly rather than through binary intuition.
The Multi-Organization Evidence Requirement
Every prediction on the MeridianOS platform resolves against a documented evidentiary record. Before the system allows the outcome to be recorded, two conditions must be met:
Condition 1: At least two sources from distinct organizations must be inserted as grading evidence for the prediction, each with a publication date, URL, quoted passage, and relevance note.
Condition 2: A structured lessons row must be completed, capturing what the predictive reasoning got right, what it got wrong, the base rate error in percentage points, which indicator signals were well-read and which were missed, a methodology adjustment for future predictions on similar questions, and two to four canonical topic tags.
These are not policy commitments. They are enforced by a PostgreSQL trigger. An UPDATE statement setting a prediction's outcome to Correct, Incorrect, Partially Correct, or Overtaken by Events is rejected by the database if either condition is unmet. The error message tells the analyst exactly what is missing. There is no override.
The multi-organization requirement exists to prevent two specific failure modes. First, circular evidence — a single analyst citing their own prior assessment as evidence for an outcome. Second, source capture — consistent reliance on a single outlet that shares the analyst's priors, producing the appearance of corroboration without independent verification. Requiring at least two distinct organizations with publication dates that postdate the original prediction forces the resolution to rest on observable, independently reported facts.
The lessons table exists to close the feedback loop that most voluntary "lessons learned" sections in customer reports do not. A quarterly PDF note that says "we were early on X" is not machine-readable, not queryable, and not structurally wired to improve future predictions on similar topics. MeridianOS's lessons rows feed directly into the premortem protocol used when new predictions are drafted — the system surfaces relevant past lessons before a new forecast is logged, not after it resolves.
How This Compares to What Incumbents Offer
The geopolitical intelligence subscription market is structured around three product shapes. Narrative intelligence (Stratfor/RANE, Control Risks, Dragonfly, Sibylline, S-RM) delivers analyst-written assessments. Indexed country risk (Verisk Maplecroft, Fitch BMI, EIU) delivers numerical scores. Custom forecasting services (Good Judgment Inc) deliver probability estimates from trained human panels.
None of these vendors — with the single exception of Good Judgment Inc — publishes calibrated predictive accuracy in the sense described above. Good Judgment Inc is the commercial heir to Philip Tetlock's IARPA ACE program (2011–2015), in which trained civilian forecasters with no access to classified information produced Brier scores approximately 30% better than U.S. intelligence community analysts with classified access, on identical questions. GJI publishes a graded track record and uses strictly proper scoring. Their superforecasters are generalists by design.
What MeridianOS does differently in three respects:
Regional depth over breadth. Tetlock's finding is that "foxes" — forecasters who update across many frames — beat "hedgehogs" — domain specialists — on horizon-blind general questions. The flip side, documented in expert-performance research by Camerer and Johnson (1991) and subsequent replications, is that domain specialists with extensive case knowledge outperform generalists on questions within their specialty. IARPA's Hybrid Forecasting Competition (2017–2019) tested generalist human-machine teams on generalist questions. The question of whether a 10-year South Asia / Gulf regional operator beats a polymath generalist specifically on Balochistan security questions or Strait of Hormuz closure risk has no public benchmark — which makes it a defensible niche. Design-partner pilots are partly an exercise in establishing that benchmark empirically.
Enforcement over policy. The multi-organization evidence requirement and structured lessons table are enforced at the database layer. Voluntary discipline degrades over time, under deadline pressure, when the analyst is confident. The database does not have deadline pressure. A voluntary lessons-learned section in a customer report and a database trigger that blocks the outcome write until the lessons row is complete are not equivalent commitments.
Decomposed scoring over single-metric claims. Publishing Reliability, Resolution, and Uncertainty separately — alongside log-loss and with explicit confidence intervals on sample sizes below 30 — is a different level of disclosure than saying "our analysts have a strong track record." It is verifiable, auditable, and adversarial in the useful sense: a buyer's analyst can recompute the numbers from the publicly available prediction data and confirm or challenge them.
The IARPA Evidence Base
The empirical foundation for calibrated forecasting as a discipline rests on a convergent body of research:
Tetlock & Gardner, Superforecasting (2015) — The book-length treatment of the IARPA ACE findings. Trained superforecasters achieved Brier scores of approximately 0.25 in Year 1, improving to approximately 0.20 by Year 4 of the tournament. Key finding: superforecasters updated frequently in small increments (5% rather than 25%), aggregated across diverse evidence, and actively sought disconfirming information.
Mellers et al. (2014) — The primary IARPA ACE results paper in Psychological Science (25(5):1106–1115). Documents the comparison between superforecaster teams and individual experts across geopolitical question categories. Superforecasters outperformed professional analysts with classified access by approximately 30%.
IARPA Hybrid Forecasting Competition (2017–2019) — Tested whether AI augmentation of superforecaster teams could improve on unassisted superforecasters. AI-augmented teams improved on unassisted humans; superforecasters still outperformed the closest AI-only system by approximately 20% on the standard test set.
Schoenegger et al. (2024), arXiv:2402.07862 — The most recent systematic evaluation of LLM-augmented human forecasters. Across 31 forecasters on 592 questions, LLM-assisted forecasters improved accuracy by 23–43% relative to unassisted humans, depending on the assistance format. Crucially, the improvement was largest for forecasters who used LLMs for information retrieval and scenario generation rather than direct probability elicitation — consistent with using LLMs as a collection and synthesis layer rather than as the forecasting agent.
MeridianOS sits in the last category: LLM-augmented collection (structured OSINT across 28 regions using purpose-built collection skills) feeding a human analyst making the probability judgments and resolving them against graded evidence. The LLM infrastructure is not the edge; the graded track record and structured feedback loop are.
ForecastBench (Karger et al., 2024, arXiv:2409.19839) provides a current benchmark. Superforecaster average Brier score on the ForecastBench question set is 0.081. Note: ForecastBench questions are drawn from Metaculus and similar public platforms and skew toward resolvable near-term questions. MeridianOS's question set covers harder, longer-horizon regional security questions; direct Brier comparison across question sets is misleading. What matters for calibration purposes is the reliability diagram, not cross-platform Brier comparison.
What the Public Track Record Contains
The MeridianOS public track record — available at the platform's calibration page — contains the following for every graded prediction:
The original prediction text, logged at the time of forecast, unedited. Retroactive wording changes are not possible; the write timestamp is part of the database record.
The probability at forecast time and every subsequent update, with timestamps. Frequent small updates are evidence of active engagement with new information; a single probability held for six months is evidence of anchoring.
The evidence record: each grading evidence row, with source name, organization, publication date, URL, quoted passage, and relevance note. The multi-organization requirement is visible here — a buyer can count the distinct organizations and verify they are independent.
The lessons row: what the reasoning got right, what it got wrong, base-rate error, indicator signals reviewed, methodology adjustment. The lessons are the intellectual accountability artifact that voluntary track records rarely include for incorrect calls.
The Brier contribution and its interpretation in the context of the prediction's probability band.
The aggregate track record displays the rolling Brier score with 95% confidence interval, the Murphy decomposition (Reliability, Resolution, Uncertainty), and the CORP reliability diagram, filterable by region, domain, and time window. All confidence intervals are shown explicitly; bins with fewer than 10 observations are flagged as statistically thin.
Verification and Audit
The track record's integrity rests on three properties:
Procedural enforcement, not policy commitment. The database trigger is the mechanism, not the analyst's discipline. Source code for the trigger is available on request.
Immutable timestamps. Predictions, updates, and outcomes are written with PostgreSQL NOW() server timestamps. The prediction creation timestamp predates any evidence rows by construction (the evidence publication-date check enforces this: publication_date >= date_made). Back-dating a forecast is structurally impossible within the system.
Multi-organization corroboration. Each graded prediction's evidence record is externally verifiable: every source URL is a public document, and the organization field is a named entity that a reviewer can independently confirm is distinct from other cited organizations.
An outside reviewer with a statistics background can independently recompute the Brier score and Murphy decomposition from the publicly displayed prediction data. A technical reviewer can request the trigger definitions and the schema to verify that enforcement is genuine. MeridianOS invites both.
What This Document Does Not Claim
This brief does not claim the platform is more accurate than any specific competitor. The comparison against incumbents is structural — about whether calibration is measured and disclosed, not about relative accuracy among products that do not publish their numbers.
It does not claim a Brier score that should be treated as the final word on quality. Sample size matters: calibration statistics on fewer than 30 graded predictions are directional at best. The current count is growing, and every graded prediction is added to the public record regardless of outcome.
It does not claim that regional depth produces a documented forecasting edge over GJI superforecasters on similar questions. That comparison has not been empirically established. The design-partner pilot phase is partly an exercise in generating the data to test it.
What it does claim: the scoring rules are strictly proper, the enforcement is structural, the evidence requirements are documented and publicly visible, and the methodology is built on a peer-reviewed empirical foundation that predates this platform by a decade.