Proper scoring rules

A scoring rule assigns a number to a probabilistic forecast once the outcome is observed. The right rule turns truthfulness into self-interest: a forecaster who is paid (or penalised) by a proper rule cannot do better than to report exactly what they believe. This single property, that truth-telling is optimal, is what makes scoring rules the elicitation engine beneath market scoring rules, the LMSR, forecast aggregation, and peer prediction. Throughout this page and the reference code, scores are written in loss form: lower is better, and the minimum expected loss is achieved by the true belief.

What "strictly proper" means

Write a categorical forecast as $p$ over $n$ outcomes, and let the forecaster's true belief be $q$. A scoring rule $S(p, y)$ produces a loss after outcome $y$ is seen. The quantity that matters for incentives is the expected loss of reporting $p$ when the world is really distributed as $q$:

$S(p, q) = \sum_{i} q_i \, S(p, i).$

The rule is proper if reporting the truth is never worse than lying, $S(q, q) \le S(p, q)$ for every $p$ and $q$, and strictly proper if the truth is the unique minimiser, so any deviation strictly hurts. This is the elicitation property: an expert with private information has a dominant strategy to reveal it, with no need to know the rule, the other forecasters, or the true outcome distribution. Geometrically (Savage 1971; Gneiting & Raftery 2007) every strictly proper rule corresponds to a strictly convex "expected-score" function $G(q) = S(q, q)$, with the rule recovered from its subgradients, the same convex-duality skeleton that reappears in cost-function makers.

In code: expected_score(score_fn, belief, p) in mechanisms/scoring_rules.py computes $S(p, q)$ directly, so you can watch the minimum land at $p = q$ for a proper rule and drift away for an improper one.

The three classics

For a forecast $p$ over a finite outcome set with realised class $y$, the three canonical strictly proper rules, in loss form (minimum $0$, lower is better):

$S_{\log}(p, y) = -\log p_y, \qquad S_{\text{Brier}}(p, y) = \sum_i \big(p_i - \mathbf 1\{i=y\}\big)^2, \qquad S_{\text{sph}}(p, y) = 1 - \dfrac{p_y}{\lVert p\rVert_2}.$

The logarithmic score (Good, 1952) is the unique smooth strictly proper local rule: it depends on the forecast only through the probability assigned to what actually happened. It is exactly cross-entropy / negative log-likelihood, and it is the score that, when run as a market, becomes Hanson's LMSR (a sequentialised log score). It is unbounded: assigning probability near zero to an outcome that then occurs incurs an arbitrarily large loss. The Brier (quadratic) score is bounded in $[0, 2]$ and is the workhorse of calibration studies; the spherical score normalises by the Euclidean length of $p$. The reward form of the spherical rule is $p_y / \lVert p \rVert_2$; the module returns $1 - \text{reward}$ so that, like the others, lower is better.

In code: log_score(p, y), brier_score(p, y), spherical_score(p, y), each taking a 1-D probability vector and an integer class index.

Point and interval forecasts

Not every forecast is a full distribution. To elicit a single quantile rather than a probability, the right loss is the pinball (tick) loss, whose unique minimiser is the true $\tau$-quantile of the predictive distribution:

$L_\tau(z, y) = (y - z)\big(\tau - \mathbf 1\{y \lt z\}\big), \qquad \tau \in (0, 1).$

It penalises over- and under-prediction asymmetrically in proportion to $\tau$; at $\tau = \tfrac12$ it reduces to (half) the absolute error and elicits the median. For a whole prediction interval, the interval score for a central $(1-\alpha)$ interval $[\ell, u]$ rewards narrowness while charging a penalty, scaled by $2/\alpha$, whenever the outcome escapes the interval:

$S^{\text{int}}_\alpha(\ell, u; y) = (u - \ell) + \tfrac{2}{\alpha}(\ell - y)\,\mathbf 1\{y \lt \ell\} + \tfrac{2}{\alpha}(y - u)\,\mathbf 1\{y \gt u\}.$

This is a proper rule for the pair of predictive quantiles at levels $\alpha/2$ and $1 - \alpha/2$, so a forecaster cannot game it by quoting an absurdly wide or narrow band; calibration and sharpness are traded off automatically.

In code: pinball_loss(z, y, tau) and interval_score(lower, upper, y, alpha).

Distributional and sample forecasts

For a real-valued forecast given as a predictive distribution and a realised value $y$, the Continuous Ranked Probability Score (CRPS) is the strictly proper rule reported in the units of the observation, reducing to absolute error for a point forecast. Given an ensemble of samples $x_1, \dots, x_m$ it has the clean energy-form estimator

$\mathrm{CRPS} = \dfrac{1}{m}\sum_i |x_i - y| - \dfrac{1}{2m^2}\sum_{i,j} |x_i - x_j|.$

The first term rewards getting close to the truth; the second rewards sharpness (a tight ensemble); together they make truthfulness about the whole distribution optimal. Its multivariate generalisation is the energy score: for an ensemble of points $x_i \in \mathbb R^d$ and an observation $y$,

$\mathrm{ES} = \dfrac{1}{m}\sum_i \lVert x_i - y\rVert^{\beta} - \dfrac{1}{2m^2}\sum_{i,j} \lVert x_i - x_j\rVert^{\beta}, \qquad \beta \in (0, 2).$

It is strictly proper for $\beta \in (0, 2)$ (Gneiting & Raftery 2007); at $\beta = 1$ it is exactly the multivariate CRPS. Because it scores a raw cloud of Monte-Carlo points rather than a parametric density, it is the natural rule for sample-based distributional-forecasting contests such as monteprediction.com, where participants submit an ensemble and are graded on how well that cloud surrounds the realised multivariate outcome.

In code: crps_ensemble(samples, y) and energy_score(samples, y, beta=1.0).

Why properness makes truthfulness dominant

The thread running through all of these rules is the same. Because the expected loss is uniquely minimised at the true belief, a forecaster maximises their expected payoff by reporting it, no second-guessing, no strategic shading, no dependence on what anyone else does. That is what "incentive-compatible elicitation" means in one sentence. Strict propriety is also what lets these rules be composed into mechanisms: differencing a proper score across successive reports yields a market scoring rule in which every trader is paid the improvement they make to a public forecast, so each is incentivised to push the price toward their own belief while the operator's worst-case loss stays bounded. Run on the logarithmic score, that construction is the LMSR; averaged across many forecasters it underwrites aggregation; and stripped of a ground-truth outcome it motivates the correlated-report scores of peer prediction.

Try it

A two-outcome forecast $p=[p_A,\,1-p_A]$ scored (loss form, lower is better) against the outcome that actually occurs. All three rules are strictly proper, each is minimised in expectation by reporting your true belief.

forecast p(A)

outcome A occurs B occurs

Code: mechanisms/scoring_rules.py · Demo: examples/sim_scoring_rules.py · Related: LMSR, aggregation · Research: parimutuel-and-scoring-rules.md