A short post on scoring rules and their connection to a divergence metric.

A scoring rule is a metric that characterises the quality of a probabilistic forecast. If we are interested in forecasting rainfall, then we let the random variable $Y$ denote a future event and $\mathcal{Y}$ the set of all possible values that $Y$ could take. In this example, we would be define $\mathcal{Y} = \mathbb{R}_{\geq 0}$ as it would not make sense to have negative rainfall. Our model is probabilistic and therefore outputs a probability distribution. We use $\mathcal{P}$ to denote the set of all valid probability distributions with support on $\mathcal{Y}$. When computing a scoring rule, we seek to compare our model’s forecasted distribution $p_t \in \mathcal{P}$ at time $t$ against a true observation $y_t$. In the context of our precipitation example, $p_t$ could be an exponential distribution with a rate parameter of 2 and $y_t$ would be a real value, such as 2.2mm, that would correspond to the true rainfall amount at time $t$.

A scoring rule $\mathcal{S}$ is then a function $S: \mathcal{P} \times \mathcal{Y} \to \mathbb{R}$. Lower scores are indicative of a higher quality forecast. If we let $q$ be the true probability distribution of rainfall, then for any $p\in\mathcal{P}$ we have $$S(p, q) = \mathbb{E}_{q}[p, Y] = \int S(p, y)\mathrm{d}q(y)\ . \tag{1}$$ A scoring rule is defined as a proper scoring rule if for all $p$ $$S(q, q) \leq S(p, q) \ . \tag{2}$$ Similarly, the proper scoring rule is strict if the inequality in (2) becomes a strictly less that. In this case, (2) will only achieve equality if $p=q$.

Scoring rules are a broad family of functions and there are many connections to statistical divergences. One example is the equivalence of the log-scoring rule $S(p, y) = \log p(y)$ and the Kullback-Leibler divergence (KLD) $\operatorname{KL}(q, p).Note the order of arguments in the KL-divergence operator matters as the divergence is asymmetric. To see this connection, we can write