Bayesian Synthetic Difference-in-Differences via Cut Posteriors

2026-03-31

A presentation of a Bayesian estimator of Synthetic Difference-in-Differences with accompanying Numpyro code.

Introduction

Synthetic Difference-in-Differences 101

Synthetic Difference-in-Differences (SDiD, Arkhangelsky et al. (2021)) is an approach for quasi-causal inference that, as the name would imply, blends Difference-in-Differences (DiD) with the Synthetic Control (SC) Method to estimate causal effects in panel data. Assuming a set of units of which a subset have received a treatment, the SDiD estimator works in three stages. First, as in SC, the weights that balance control units against the treated unit are learned. In the second set a vector of time weights that balance pre- against post-treatment periods are learned, before finally a treatment effect is estimated from a weighted two-way fixed effects regression.

To ensure solid foundations, let’s connect the three-stage structure of SDiD to the identification and estimation distinction (Petersen & van der Laan, 2014; Hernán & Robins, 2020). The estimand under consideration here is the Average Treatment Effect on the Treated (ATT): $\tau = \mathbb{E}[Y(1) - Y(0) \mid D = 1]$. We observe $Y(1)$ for treated units post-treatment, but the counterfactual $Y(0)$, a quantification of what would have happened absent treatment, is missing. Under the regular SC assumptions the unit weights $\omega$ and time weights $\lambda$ provide identification by connecting the observed control data to the missing counterfactual. Once the weights are fixed, estimating $\tau$ is reduced down to a stastical task of determining whether or not we can identify $\tau$ with the desired precision.

Figure 1: Data partitioning in SDiD. Each stage uses a specific slice of the panel, preserving the separation of concerns.

The Bayesian Jump

I’ve wanted to make the SDiD estimator Bayesian for a while, but the difficulty is that the three-stage structure is not just a computational convenience. In stark contrast to regular Bayesian inference where we’d have a single likelihood specified over all the data, the separation in SDiD presents a challenge as each of the stages are purposefully computed using separate slices of the data. Unit weights should reflect only pre-treatment data, time weights only control-unit data, and neither should be distorted by the outcome likelihood that identifies the treatment effect. Shoehorning all parameters into a single Bayesian model would break this separation as the outcome likelihood would pull the weights away from their balancing objective, allowing the estimand to influence its own identification.

It’s natural at this point to question why it is worth pursuing a Bayesian analogue to SDiD, given the challenges I have just mentioned and the fact that ordinary SDiD works very well already. Well, allow me to step onto my soapbox for just a minute and I’ll justify my motivations. First, the Bayesian framework turns the regularisation parameters from opaque tuning knobs into interpretable prior choices, and the model itself becomes a composable object. Under the model I propose here, one could swap likelihood specifications, add hierarchical structure for multiple treated units, or incorporate informative priors from related studies, all without rederiving the estimator from scratch. In the world of quasi-causal inference, models are seldom built absent of stakeholder/expert input and, therefore, having a mechanism within the model to impart such guidance is hugely helpful in constructing richer estimators, and also establishing stakeholder trust in the model by enabling them to guide its construction.

A second reason for pursuing a Bayesian SDiD estimator is to have access to the ATT estimand’s posterior distribution. Through the posterior distribution, we gain the ability to make direct probability statements about the treatment effect and its associated (un)certainty. This is often hailed as the biggest selling point of a Bayesian workflow, and I believe this benefit to be particularly pronounced here. A 94% posterior interval for $\tau$ means that given the data and the model, there is a 94% probability that $\tau$ lies in this interval. Compare this to the frequentist confidence interval, which asserts only that if the experiment were repeated many times, 95% of intervals constructed this way would contain the true $\tau$. For a one-off policy evaluation like Proposition 99, “the treatment effect reduced per-capita sales by 10–20 packs with 94% probability” is a more direct answer to the question a policymaker actually asks than “we cannot reject the null at the 5% level, and the 95% CI is …”. Again, due to the working nature of these models, communicating results derived from posterior intervals to stakeholders is almost always simpler and more transparent mode of reporting than a frequentist confidence interval argument.

Finally, uncertainty propagates end-to-end whereby posterior draws of the weights flow through the double-difference formula to produce a full posterior over $\tau$, rather than requiring a separate variance estimation procedure such as the Jackknife or placebo resampling. From an estimation persepctive, I find this to be a much cleaner approach to reasoning about uncertainty and resolves the challenge of selecting a variance estimator in favour of directly lifting the uncertainty estimate from the inferred posterior.

A Note on Cut Posteriors

The approach outlined in this post was sparked whilst reading on cut posteriors (Plummer, 2015)for an independent piece of work. The overarching principle of a cut posterior is that we can take an ordinary Bayesian model and partition it into a set of constituent sub-models, or modules. Each model is given its own likelihood and the feedback from downstream modules to upstream parameters is blocked, or cut, to prevent cross-contamination of modules. The cut posterior respects the identification-estimation boundary by allowing the weight modules to handle identification, the treatment-effect module to handle estimation, and, consequently, information flows in only one direction.

My original attempt at implementing a Bayesian analogue of SDiD did indeed use a full cut-posterior; however, performing inference in such a model was very challenging. It is of note here that in ChiRho package there exists a fully cut posterior in which inference is done through variational inference. This lead me to reassess the model and notice a simplification was possible that allows me to restrict information flow directly via implementation. In practice, I estimate the weight modules via MCMC and then compute the treatment effect analytically from posterior draws. This preserves the three-stage logic whilst propagating weight uncertainty into the treatment effect posterior. Therefore, whilst the approach here is not a true cut-posterior and is probably more cut-inspired, I shall hereon refer to it as the cut model.

Disclaimer

The ideas and thought process reflected in this process is my own - the problem of forming a Bayesian-variant of SDiD is something I have been mulling over for some time. However, I want to be clear that AI, namely Opus 4.6, was used to generate a good portion of the Python and Mermaid code used here, provide “adversarial” reviews, and helped me to refine my thought process. I find this to be a particularly useful process of AI-assisted science, and I plan to share some of the Skills and loops I’ve found particularly useful in the future once I’ve had time to better stress-test them.

The SDiD estimator

Given a balanced panel dataset $Y \in \mathbb{R}^{N \times T}$ with $N_{\text{co}}$ control units and $N_{\text{tr}}$ treated units observed over $T_{\text{pre}}$ pre-treatment and $T_{\text{post}}$ post-treatment periods, the frequentist SDiD proceeds in three stages.

Stage 1: Unit weights. Find simplex weights $\hat{\omega} \in \Delta^{N_{\text{co}}}$ that balance control units against treated units in the pre-treatment period:

$$\hat{\omega} = \arg\min_{\omega \in \Delta^{N_{\text{co}}}} \sum_{t=1}^{T_{\text{pre}}} \left(\omega_0 + \sum_{j \in \text{co}} \omega_j Y_{jt} - \bar{Y}_{\text{tr},t}\right)^2 + \zeta^2 T_{\text{pre}} \|\omega\|_2^2.$$

Stage 2: Time weights. Find simplex weights $\hat{\lambda} \in \Delta^{T_{\text{pre}}}$ that balance pre-treatment periods against post-treatment periods for control units:

$$\hat{\lambda} = \arg\min_{\lambda \in \Delta^{T_{\text{pre}}}} \sum_{i \in \text{co}} \left(\lambda_0 + \sum_{s=1}^{T_{\text{pre}}} \lambda_s Y_{is} - \bar{Y}_{i,\text{post}}\right)^2 + \zeta^2 N_{\text{co}} \|\lambda\|_2^2.$$

The regularisation terms play asymmetric roles. For unit weights, $\zeta$ controls the bias-variance trade-off between DiD and SC. For time weights, Arkhangelsky et al. (2021) use near-zero regularisation ($\zeta \approx 10^{-6}\hat{\sigma}$), letting the time weights concentrate on whichever pre-treatment periods are most informative. The panel’s correlation structure makes the time-weight problem well-conditioned without explicit shrinkage (Appendix 7.2 of the original paper shows the time weights concentrating exclusively on 1986–1988 for Proposition 99).

Stage 3: Weighted regression. Estimate $\tau$ from a weighted two-way fixed effects regression with observation weights $w_{it}$:

$$\hat{\tau}^{\text{sdid}} = \arg\min_{\tau, \alpha, \beta} \sum_{i,t} w_{it} \left(Y_{it} - \alpha_i - \beta_t - \tau D_{it}\right)^2,$$

where the weight matrix has the structure:

	Pre-treatment	Post-treatment
Control	$\hat{\omega}_i \hat{\lambda}_t$	$\hat{\omega}_i / T_{\text{post}}$
Treated	$\hat{\lambda}_t / N_{\text{tr}}$	$1 / (N_{\text{tr}} T_{\text{post}})$

I want to draw the reader’s attention explicitly here to separation of concerns in that unit weights see only pre-treatment data, time weights see only control-unit data, and the treatment effect uses the full panel with the weights fixed.

A modular Bayesian formulation

Why joint estimation fails

The obvious first attempt would be to place priors on $\omega$, $\lambda$, and $\tau$, write a likelihood that factorises over all the data, and sample. However, unfortunately, this fails as the outcome likelihood in Stage 3 would update the weight parameters, pulling them away from their balancing objective. The unit weights would no longer reflect pre-treatment balance and they would instead be distorted by exactly the post-treatment data they’re supposed to be decoupled from. I demonstrate this bias empirically in the Results section below and to distinguish from the cut-model, I shall refer to it as the coupled model.

A natural follow-up question would be “if the weight modules must be insulated from the treatment-effect module, then why not simply run them as entirely separate Bayesian models?”. I think you can certainly do this, as Modules 1 and 2 are conditionally independent given the data and, consequently, share no parameters. For me, this is simply a matter of taste, and running them in a single NumPyro model is purely a matter of convenience whereby one MCMC call handles all inference. The important point is that the cut is not about separating the weight modules from each other but instead about separating both weight modules from the treatment-effect computation. Whether Modules 1 and 2 live in one model or two is immaterial to the cut.

The cut posterior

Under the cut-posterior framework (Plummer, 2015), we would first partition the model into a set of constituent modules, each with its own likelihood, and cut the feedback from downstream modules to upstream parameters. The posterior then factorises as:

$$p(\tau, \omega, \lambda \mid Y) \propto p(\tau \mid \omega, \lambda, Y) \; p(\omega \mid Y_{\text{pre}}) \; p(\lambda \mid Y_{\text{co}}).$$

Each factor is a separate module:

Module 1: $p(\omega \mid Y_{\text{pre}})$: the unit-weight posterior, identified by pre-treatment matching.
Module 2: $p(\lambda \mid Y_{\text{co}})$: the time-weight posterior, identified by control-unit matching.
Module 3: $p(\tau \mid \omega, \lambda, Y)$: the treatment effect, determined by the SDiD double-difference formula given the weights.

The cut blocks the treatment-effect computation from feeding information back into the weight parameters. Uncertainty in the weights still flows forward, since each draw of $(\omega, \lambda)$ produces a different estimate of $\tau$.

Figure 2: Information flow in the cut posterior. Solid arrows show forward propagation of weight draws into the treatment-effect computation. Dashed arrows with ✗ show the cut — Module 3 cannot update the weight posteriors.

Prior specification

The unit weights should belong to the simplex. In my Bayesian SC I used the Dirichlet prior to achieve this; however, here I shall constrain the unit weights to the simplex via the softmax: $\omega = \text{softmax}(\tilde{\omega})$ with $\tilde{\omega}_1 = 0$ (reference level) and $\tilde{\omega}_j \sim \mathcal{N}(0, 1/\zeta_\omega)$ for $j = 2, \ldots, N_{\text{co}}$. Pinning one logit removes softmax’s shift invariance ($\text{softmax}(x+c) = \text{softmax}(x)$), which would otherwise create a flat direction in the posterior. The prior scale $1/\zeta_\omega$ plays the role of $\ell_2$ regularisation in the frequentist formulation whereby a large $\zeta_\omega$ will pull the values toward uniform weights, whilst small $\zeta_\omega$ lets the data concentrate weight on a few well-matching control units.

It is worth pausing here to observe that the spectrum of weight values given by $\zeta_\omega$ allows us to connect the estimator to SC and DiD. In the limit of completely-uniform weights, we have a DiD-like estimator, whilst the very sparse weights given by a small $\zeta_\omega$ value are more akin to those estimated in SC.

The matching for the unit weights is:

$$\bar{Y}_{\text{tr},t} \sim \mathcal{N}\!\left(\omega_0 + \boldsymbol{\omega}^\top \mathbf{Y}_{\text{co},t},\; \sigma_\omega^2\right), \quad t = 1, \ldots, T_{\text{pre}}.$$

For the time weights, the same softmax parameterisation is used whereby $\lambda = \text{softmax}(\tilde{\lambda})$ with $\tilde{\lambda}_1 = 0$ and $\tilde{\lambda}_s \sim \mathcal{N}(0, 1/\zeta_\lambda)$ for $s = 2, \ldots, T_{\text{pre}}$. Following the original paper, I set $\zeta_\lambda = 0.01$, giving a prior SD of 100 on the logits. This is essentially a diffuse flat prior. The time weights can concentrate on whichever pre-treatment periods are most informative without being shrunk toward uniformity. The time weights’ associated likelihood is:

$$\bar{Y}_{i,\text{post}} \sim \mathcal{N}\!\left(\lambda_0 + \boldsymbol{\lambda}^\top \mathbf{Y}_{i,\text{pre}},\; \sigma_\lambda^2\right), \quad i \in \text{control}.$$

The convenient detail that means I do not need a full cut-posterior is that the treatment effect $\tau$ does not need to be sampled, and can instead computed analytically for each posterior draw of $(\omega, \lambda)$ via the SDiD double-difference formula:

$$\hat{\tau}(\omega, \lambda) = \left(\bar{g}_{\text{post}} - \boldsymbol{\lambda}^\top \mathbf{g}_{\text{pre}}\right), \qquad g_t = \bar{Y}_{\text{tr},t} - \boldsymbol{\omega}^\top \mathbf{Y}_{\text{co},t}.$$

This is equivalent to the weighted two-way fixed-effects regression used in Arkhangelsky et al. (2021) but sidesteps the $N + T - 1$ fixed-effect parameters entirely. The weight-only posterior of $\tau$ is the pushforward of the weight posteriors through this formula; I augment it with a posterior predictive that accounts for idiosyncratic outcome noise, using the empirical within-unit covariance and the weight-specific structure of the double-difference.

Data

The Proposition 99 dataset is a common benchmark for SC methods. In November 1988, California imposed a 25-cent-per-pack excise tax on cigarettes along with restrictions on smoking in public spaces. The dataset contains annual per-capita cigarette sales (in packs) for 39 US states from 1970 to 2000.

Show imports

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import arviz as az
import jax
import jax.lax
import jax.random as jr
import jax.numpy as jnp
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
import numpyro
import pandas as pd
import seaborn as sns

from numpyro import distributions as dist
from numpyro.infer import MCMC, NUTS

jax.config.update("jax_platform_name", "cpu")
numpyro.set_host_device_count(4)
key = jr.PRNGKey(123)
sns.set_theme(
    context="notebook",
    font="serif",
    style="whitegrid",
    palette="deep",
    rc={"figure.figsize": (6, 3), "axes.spines.top": False, "axes.spines.right": False},
)
cols = mpl.rcParams["axes.prop_cycle"].by_key()["color"]

Show data preprocessing

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
data = pd.read_csv("datasets/california_smoking.csv", sep=";")

treatment_year = 1989
treated_state = "California"

controls = data.loc[data.state != treated_state].pivot(
    index="year", columns="state", values="cigsale"
)
treated = data.loc[data.state == treated_state, ["year", "cigsale"]].set_index("year")

control_states = list(controls.columns)
years = sorted(data["year"].unique().astype(int))
N_co = len(control_states)
T_pre = len([y for y in years if y < treatment_year])
T_post = len([y for y in years if y >= treatment_year])
N = N_co + 1
T = T_pre + T_post

N_tr = N - N_co

Y_co = controls.values.T
Y_tr = treated.values.T
Y = jnp.array(np.vstack([Y_co, Y_tr]))

Y_mean = float(Y.mean())
Y_std = float(Y.std())
Y_scaled = (Y - Y_mean) / Y_std

Show panel plot

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
def clean_legend(ax):
    handles, labels = ax.get_legend_handles_labels()
    by_label = dict(zip(labels, handles, strict=False))
    ax.legend(by_label.values(), by_label.keys(), loc="best")
    return ax


fig, ax = plt.subplots(figsize=(8, 3))
ax.plot(
    years,
    controls.values,
    color="grey",
    alpha=0.3,
    linewidth=0.8,
    label="Control states",
)
ax.plot(years, treated.values, color=cols[0], linewidth=2, label="California")
ax.axvline(treatment_year, color=cols[2], linestyle="--", label="Proposition 99")
clean_legend(ax)
ax.set(xlabel="Year", ylabel="Per-capita cigarette sales (packs)")
fig.tight_layout()

Show plotting helpers

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
def plot_weight_posteriors(
    samples: dict,
    state_names: list[str],
    pre_years: list[int],
) -> plt.Figure:
    """Bar charts of unit and time weight posteriors with credible intervals."""
    N_co = len(state_names)
    T_pre = len(pre_years)
    omega = np.array(samples["omega"]).reshape(-1, N_co)
    lam = np.array(samples["lam"]).reshape(-1, T_pre)

    # Sort states by posterior median weight (descending)
    med_omega = np.median(omega, axis=0)
    sort_idx = np.argsort(med_omega)[::-1]

    fig, axes = plt.subplots(1, 2, figsize=(12, 4))

    # Unit weights (sorted)
    ax = axes[0]
    sorted_med = med_omega[sort_idx]
    sorted_lo, sorted_hi = np.percentile(omega[:, sort_idx], [3, 97], axis=0)
    sorted_names = [state_names[i] for i in sort_idx]
    x = np.arange(N_co)
    ax.bar(x, sorted_med, color=cols[1], alpha=0.6, edgecolor="white")
    ax.vlines(x, sorted_lo, sorted_hi, color=cols[1], linewidth=1.2)
    ax.axhline(
        1.0 / N_co,
        color="grey",
        linestyle="--",
        linewidth=1,
        label=f"Uniform (1/{N_co})",
    )
    ax.set_xticks(x[::3])
    ax.set_xticklabels(
        [sorted_names[i] for i in range(0, N_co, 3)],
        rotation=70,
        fontsize=7,
        ha="right",
    )
    ax.set(ylabel="Weight", title="Unit Weights ($\\omega$)")
    ax.legend(fontsize=8)

    # Time weights
    ax = axes[1]
    med_lam = np.median(lam, axis=0)
    lo_lam, hi_lam = np.percentile(lam, [3, 97], axis=0)
    ax.bar(pre_years, med_lam, color=cols[3], alpha=0.6, edgecolor="white")
    ax.vlines(pre_years, lo_lam, hi_lam, color=cols[3], linewidth=1.5)
    ax.axhline(
        1.0 / T_pre,
        color="grey",
        linestyle="--",
        linewidth=1,
        label=f"Uniform (1/{T_pre})",
    )
    ax.set(xlabel="Year", ylabel="Weight", title="Time Weights ($\\lambda$)")
    ax.set_xticks(pre_years[::2])
    ax.set_xticklabels([str(y) for y in pre_years[::2]], rotation=45, fontsize=7, ha="right")
    ax.legend(fontsize=8)

    fig.tight_layout()
    return fig


def plot_counterfactual(
    samples: dict,
    Y: np.ndarray,
    N_co: int,
    T_pre: int,
    years: list[int],
    treatment_year: int,
    Y_mean: float = 0.0,
    Y_std: float = 1.0,
) -> plt.Figure:
    """Observed vs synthetic control trajectory and period-by-period effects."""
    omega = np.array(samples["omega"]).reshape(-1, N_co)
    omega0 = np.array(samples["omega0"]).reshape(-1)

    N, T = Y.shape

    # Treated trajectory (original scale)
    Y_co = Y[:N_co]
    Y_tr = Y[N_co:].mean(axis=0)

    # Synthetic control for each posterior draw
    # omega was fit on scaled data, so compute on scaled data then back-transform
    Y_co_scaled = (Y_co - Y_mean) / Y_std
    sc_draws_scaled = omega0[:, None] + omega @ Y_co_scaled
    sc_draws = sc_draws_scaled * Y_std + Y_mean
    sc_median = np.median(sc_draws, axis=0)
    sc_lo, sc_hi = np.percentile(sc_draws, [3, 97], axis=0)

    fig, axes = plt.subplots(
        1, 2, figsize=(12, 4), gridspec_kw={"width_ratios": [2, 1]}
    )

    # Left: trajectories
    ax = axes[0]
    ax.plot(
        years,
        Y_tr,
        "o-",
        color="black",
        markersize=3,
        linewidth=1.5,
        label="California (observed)",
    )
    ax.plot(
        years,
        sc_median,
        "s--",
        color=cols[1],
        markersize=3,
        linewidth=1.5,
        label="Synthetic California",
    )
    ax.fill_between(years, sc_lo, sc_hi, alpha=0.2, color=cols[1], label="94% CI")
    ax.axvline(
        treatment_year - 0.5,
        color=cols[2],
        linestyle="--",
        linewidth=2.5,
        label="Proposition 99",
    )
    ax.set(
        xlabel="Year",
        ylabel="Per-capita cigarette sales (packs)",
        title="California vs. Synthetic Control",
    )
    ax.legend(fontsize=8)

    # Right: period-by-period effect
    ax2 = axes[1]
    post_years = [y for y in years if y >= treatment_year]
    te_draws = Y_tr[T_pre:][np.newaxis, :] - sc_draws[:, T_pre:]
    te_median = np.median(te_draws, axis=0)
    te_lo, te_hi = np.percentile(te_draws, [3, 97], axis=0)

    ax2.bar(post_years, te_median, color=cols[1], alpha=0.6, edgecolor="white")
    ax2.vlines(post_years, te_lo, te_hi, color=cols[1], linewidth=2)
    ax2.axhline(0, color="grey", linestyle="--", linewidth=0.5)
    ax2.set(
        xlabel="Year",
        ylabel="$\\hat{\\tau}_t$ (packs)",
        title="Period-by-Period Effect",
    )

    fig.tight_layout()
    return fig

Model

The NumPyro model below encodes Modules 1 and 2. Module 3 (the treatment effect) is computed deterministically from the weight posterior draws.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
def bayesian_sdid(
    Y: jnp.ndarray,
    N_co: int,
    T_pre: int,
    zeta_omega: float = 1.0,
    zeta_lambda: float = 0.01,
):
    N, T = Y.shape
    Y_co_pre = Y[:N_co, :T_pre]
    y_tr_pre_mean = Y[N_co:, :T_pre].mean(axis=0)
    y_co_post_mean = Y[:N_co, T_pre:].mean(axis=1)

    # Module 1: Unit weights
    omega_raw = numpyro.sample(
        "omega_raw", dist.Normal(0, 1.0 / zeta_omega).expand([N_co - 1])
    )
    omega_tilde = jnp.concatenate([jnp.zeros(1), omega_raw])
    omega = numpyro.deterministic("omega", jax.nn.softmax(omega_tilde))
    omega0 = numpyro.sample("omega0", dist.Normal(0, 5.0))
    sigma_omega = numpyro.sample("sigma_omega", dist.HalfNormal(1.0))
    # Corresponding unit weights' liklelihood
    numpyro.sample(
        "omega_match",
        dist.Normal(omega0 + Y_co_pre.T @ omega, sigma_omega),
        obs=y_tr_pre_mean,
    )

    # Module 2: Time weights
    lambda_raw = numpyro.sample(
        "lambda_raw", dist.Normal(0, 1.0 / zeta_lambda).expand([T_pre - 1])
    )
    lambda_tilde = jnp.concatenate([jnp.zeros(1), lambda_raw])
    lam = numpyro.deterministic("lam", jax.nn.softmax(lambda_tilde))
    lambda0 = numpyro.sample("lambda0", dist.Normal(0, 5.0))
    sigma_lambda = numpyro.sample("sigma_lambda", dist.HalfNormal(1.0))
    # Corresponding time weights' likelihood
    numpyro.sample(
        "lambda_match",
        dist.Normal(lambda0 + Y_co_pre @ lam, sigma_lambda),
        obs=y_co_post_mean,
    )

Each module defines its own likelihood (omega_match and lambda_match) over its data slice. The weights are sampled without any treatment-effect likelihood.

Figure 3: The Bayesian SDiD model. Modules 1 and 2 are sampled jointly via MCMC. The treatment effect τ is computed analytically from weight posterior draws, enforcing the cut by construction.

Implementing the “cut” in NumPyro

As previously mentioned, Modules 1 and 2 are already parameter-independent as no parameter from the unit-weight module appears in the time-weight likelihood, and vice versa. The joint posterior factorises naturally as $p(\omega, \lambda \mid Y) = p(\omega \mid Y)\, p(\lambda \mid Y)$, so there is nothing to “cut” between these two modules. A formal cut posterior would only be required if the modules shared parameters and we wanted to block feedback. An example of this would be if a third outcome-likelihood module depended on both $\omega$ and $\lambda$ and we applied jax.lax.stop_gradient to prevent the outcome data from updating the weight posteriors.

The regularisation parameters mirror the asymmetry in the original SDiD. zeta_omega = 1.0 gives a prior standard deviation of 1 on the softmax logits. This is weak enough for the matching likelihood to push weights away from uniform toward sparse, SC-like solutions, but strong enough to prevent degenerate concentration on a single state. zeta_lambda = 0.01 gives a prior standard deviation of 100, essentially flat, so the time weights can concentrate on whichever pre-treatment periods are most informative.

Coupled joint model

For comparison, I fit a coupled joint model that adds a post-treatment gap likelihood inside the sampler. Because the post-treatment likelihood depends on both the unit weights and the time weights, it feeds outcome information back into the balancing modules. This is precisely the feedback channel the cut model avoids. Given I spent time explaining conceptually why this is problematic, its worthwhile empirically (in)validating this point.

Show coupled joint model

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
def coupled_joint_bayesian_sdid(
    Y: jnp.ndarray,
    N_co: int,
    T_pre: int,
    zeta_omega: float = 1.0,
    zeta_lambda: float = 0.01,
):
    N, T = Y.shape
    Y_co = Y[:N_co]
    y_tr = Y[N_co:, :].mean(axis=0)
    Y_co_pre = Y_co[:, :T_pre]
    y_tr_pre_mean = y_tr[:T_pre]
    y_co_post_mean = Y_co[:, T_pre:].mean(axis=1)

    omega_raw = numpyro.sample(
        "omega_raw", dist.Normal(0, 1.0 / zeta_omega).expand([N_co - 1])
    )
    omega_tilde = jnp.concatenate([jnp.zeros(1), omega_raw])
    omega = numpyro.deterministic("omega", jax.nn.softmax(omega_tilde))
    omega0 = numpyro.sample("omega0", dist.Normal(0, 5.0))
    sigma_omega = numpyro.sample("sigma_omega", dist.HalfNormal(1.0))
    numpyro.sample(
        "omega_match",
        dist.Normal(omega0 + Y_co_pre.T @ omega, sigma_omega),
        obs=y_tr_pre_mean,
    )

    lambda_raw = numpyro.sample(
        "lambda_raw", dist.Normal(0, 1.0 / zeta_lambda).expand([T_pre - 1])
    )
    lambda_tilde = jnp.concatenate([jnp.zeros(1), lambda_raw])
    lam = numpyro.deterministic("lam", jax.nn.softmax(lambda_tilde))
    lambda0 = numpyro.sample("lambda0", dist.Normal(0, 5.0))
    sigma_lambda = numpyro.sample("sigma_lambda", dist.HalfNormal(1.0))
    numpyro.sample(
        "lambda_match",
        dist.Normal(lambda0 + Y_co_pre @ lam, sigma_lambda),
        obs=y_co_post_mean,
    )

    # Post-treatment gap likelihood — creates the feedback channel
    gaps = numpyro.deterministic("gaps", y_tr - omega @ Y_co)
    baseline_gap = numpyro.deterministic("baseline_gap", jnp.dot(lam, gaps[:T_pre]))
    tau = numpyro.sample("tau", dist.Normal(0, 10.0))
    sigma_gap = numpyro.sample("sigma_gap", dist.HalfNormal(1.0))
    numpyro.sample(
        "post_gap_obs",
        dist.Normal(baseline_gap + tau, sigma_gap).expand([T - T_pre]),
        obs=gaps[T_pre:],
    )

Posterior sampling

I sample the weight parameters via NUTS. In total, the model has 58 sampled parameters - $(N_{\text{co}} - 1) + (T_{\text{pre}} - 1) = 54$ weight logit parameters plus 4 matching parameters ($\omega_0$, $\lambda_0$, $\sigma_\omega$, $\sigma_\lambda$). After sampling, I compute $\tau$ for each posterior draw via the double-difference formula.

Show posterior predictive noise helper

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
def posterior_predictive_noise(
    Y_co: np.ndarray,
    omega: np.ndarray,
    lam: np.ndarray,
    T_pre: int,
    Y_std: float,
) -> np.ndarray:
    """Compute per-draw posterior predictive noise SD for the double-difference.

    Uses the empirical within-unit covariance Σ̂ from the de-meaned control
    panel.  Under independence across units the noise variance is:

        V(ω, λ) = (1 + ‖ω‖²) · c' Σ̂ c

    where c_t = −λ_t (pre) or 1/T_post (post).

    Returns σ_τ in original (packs) scale for each flattened draw.
    """
    N_co, T = Y_co.shape
    T_post = T - T_pre

    # TWFE residuals
    resid = (
        Y_co
        - Y_co.mean(axis=1, keepdims=True)
        - Y_co.mean(axis=0, keepdims=True)
        + Y_co.mean()
    )

    # Empirical within-unit T×T covariance, pooled across control units
    Sigma_hat = np.einsum("nt,ns->ts", resid, resid) / N_co

    # Flatten chains
    omega_flat = omega.reshape(-1, N_co)
    lam_flat = lam.reshape(-1, T_pre)
    n_draws = omega_flat.shape[0]

    # Time-weight contrast vector
    c = np.zeros((n_draws, T))
    c[:, :T_pre] = -lam_flat
    c[:, T_pre:] = 1.0 / T_post

    # Quadratic form c' Σ̂ c and unit factor
    cSc = np.einsum("dt,ts,ds->d", c, Sigma_hat, c)
    unit_factor = 1.0 + (omega_flat**2).sum(axis=1)
    V_draw = unit_factor * cSc

    return np.sqrt(V_draw) * Y_std

Show sampling routine

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
kernel = NUTS(bayesian_sdid, target_accept_prob=0.95)
mcmc = MCMC(
    kernel,
    num_warmup=4000,
    num_samples=3000,
    num_chains=4,
    chain_method="parallel",
    progress_bar=False,
)
key, subkey = jr.split(key)
mcmc.run(
    subkey,
    Y=Y_scaled,
    N_co=N_co,
    T_pre=T_pre,
    zeta_omega=1.0,
    zeta_lambda=0.01,
)
samples = mcmc.get_samples(group_by_chain=True)

coupled_kernel = NUTS(coupled_joint_bayesian_sdid, target_accept_prob=0.95)
coupled_mcmc = MCMC(
    coupled_kernel,
    num_warmup=4000,
    num_samples=3000,
    num_chains=4,
    chain_method="parallel",
    progress_bar=False,
)
key, subkey = jr.split(key)
coupled_mcmc.run(
    subkey,
    Y=Y_scaled,
    N_co=N_co,
    T_pre=T_pre,
    zeta_omega=1.0,
    zeta_lambda=0.01,
)
coupled_samples = coupled_mcmc.get_samples(group_by_chain=True)
coupled_tau = np.array(coupled_samples["tau"]) * Y_std

omega_all = np.array(samples["omega"])
lam_all = np.array(samples["lam"])
Y_co_np = np.array(Y_scaled[:N_co])
y_tr_np = np.array(Y_scaled[N_co:].mean(axis=0))

gaps = y_tr_np - omega_all @ Y_co_np
tau_scaled = gaps[..., T_pre:].mean(axis=-1) - (gaps[..., :T_pre] * lam_all).sum(
    axis=-1
)
tau_packs = tau_scaled * Y_std

# ── Posterior predictive: add residual outcome noise ──────────────
sigma_tau = posterior_predictive_noise(
    Y_co=np.array(Y_scaled[:N_co]),
    omega=omega_all,
    lam=lam_all,
    T_pre=T_pre,
    Y_std=Y_std,
)

rng_tau = np.random.default_rng(0)
tau_pp = tau_packs.flatten() + rng_tau.normal(0, sigma_tau)
tau_pp = tau_pp.reshape(tau_packs.shape)

idata = az.from_dict(
    posterior={
        "tau": tau_packs,
        "tau_pp": tau_pp,
        "omega": omega_all,
        "lam": lam_all,
        "sigma_omega": np.array(samples["sigma_omega"]),
        "sigma_lambda": np.array(samples["sigma_lambda"]),
    }
)

az.summary(
    idata, var_names=["tau", "tau_pp", "sigma_omega", "sigma_lambda"], hdi_prob=0.94
)

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
tau	-14.562	2.417	-18.393	-9.682	0.099	0.026	629.0	3007.0	1.01
tau_pp	-14.508	10.062	-33.688	4.258	0.125	0.066	6502.0	10062.0	1.00
sigma_omega	0.091	0.025	0.048	0.135	0.000	0.000	4722.0	6708.0	1.00
sigma_lambda	0.295	0.036	0.231	0.362	0.000	0.000	11136.0	8104.0	1.00

Results

Treatment effect

The posterior of $\tau$ gives the ATT of Proposition 99 on per-capita cigarette sales: $\tau = \mathbb{E}[Y(1) - Y(0) \mid \text{treated}]$, where $Y(0)$ is the counterfactual outcome absent the policy. Negative values mean the policy reduced consumption. In the below plot I first show the weight-only posterior which is the pushforward of weight uncertainty through the double-difference formula. In the right-panel I also show the full posterior predictive which additionally integrates over idiosyncratic outcome noise, using the empirical within-unit covariance and the weight-specific structure of the double-difference.

Show posterior plotting code

1
2
3
4
5
6
fig, axes = plt.subplots(1, 2, figsize=(10, 3))
az.plot_posterior(idata, var_names=["tau"], hdi_prob=0.94, ax=axes[0])
axes[0].set_title("Weight-only posterior")
az.plot_posterior(idata, var_names=["tau_pp"], hdi_prob=0.94, ax=axes[1])
axes[1].set_title("Posterior predictive (+ outcome noise)")
fig.tight_layout()

Coupled joint estimation comparison

To see the information restriction in action, we can also compare the posterior distributions of $\tau$ as estimated by the cut model and the coupled joint model. As a point of reference, we also overlay the original frequentist SDiD’s point estimate (-15.6) of the ATT.

Show cut vs coupled comparison

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
fig, ax = plt.subplots(figsize=(7, 3.5))

posterior_draws = tau_packs.flatten()
coupled_draws = coupled_tau.flatten()
ax.hist(
    posterior_draws,
    bins=50,
    density=True,
    alpha=0.6,
    color=cols[1],
    edgecolor="white",
    label=f"Cut model (mean={posterior_draws.mean():.1f})",
)
ax.axvline(x=posterior_draws.mean(), color=cols[1], linestyle="--")
ax.hist(
    coupled_draws,
    bins=50,
    density=True,
    alpha=0.6,
    color=cols[2],
    edgecolor="white",
    label=f"Coupled joint (mean={coupled_draws.mean():.1f})",
)
ax.axvline(x=coupled_draws.mean(), color=cols[2], linestyle="--")
ax.axvline(x=-15.6, color="black", label="SDiD Reference", linestyle="--")
ax.set(
    xlabel="Treatment effect (packs)",
    ylabel="Density",
    title="$\\tau$: Cut vs. Coupled Joint Estimation",
)
ax.legend(fontsize=8)

fig.tight_layout()

Weight posteriors

The weight posteriors show which control states and which pre-treatment years the model leans on. Sparse unit weights mean a few states dominate synthetic California (SC-like); near-uniform weights mean all controls contribute roughly equally (DiD-like). Time weights that pile up on later pre-treatment years suggest the period just before Proposition 99 is most informative for the counterfactual.

Show weight posterior plotting code

1
2
pre_years = [int(y) for y in years if y < treatment_year]
fig = plot_weight_posteriors(mcmc.get_samples(), control_states, pre_years)

Posterior predictive checks

Both matching likelihoods make testable predictions. Module 1 predicts the treated unit’s pre-treatment period means as a weighted average of the control panel, and Module 2 predicts each control unit’s post-treatment mean from a weighted average of its pre-treatment trajectory. Posterior predictive checks allow one to verify that $\sigma_\omega$ and $\sigma_\lambda$ aren’t absorbing excessive slack whereby wide intervals would signal that the noise parameters are compensating for poor structural fit.

Show MCMC sample postprocessing

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
ppc_samples = mcmc.get_samples()
omega_ppc = np.array(ppc_samples["omega"])
omega0_ppc = np.array(ppc_samples["omega0"])
sigma_omega_ppc = np.array(ppc_samples["sigma_omega"])
lam_ppc = np.array(ppc_samples["lam"])
lambda0_ppc = np.array(ppc_samples["lambda0"])
sigma_lambda_ppc = np.array(ppc_samples["sigma_lambda"])

Y_co_pre = np.array(Y_scaled[:N_co, :T_pre])
y_tr_pre_obs = np.array(Y_scaled[N_co:, :T_pre].mean(axis=0))
y_co_post_obs = np.array(Y_scaled[:N_co, T_pre:].mean(axis=1))

rng = np.random.default_rng(42)

mu_omega = omega0_ppc[:, None] + omega_ppc @ Y_co_pre
ppc_omega = rng.normal(mu_omega, sigma_omega_ppc[:, None])

mu_lambda = lambda0_ppc[:, None] + lam_ppc @ Y_co_pre.T
ppc_lambda = rng.normal(mu_lambda, sigma_lambda_ppc[:, None])

Show PPC plots

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Left: Module 1 — time series PPC
ax = axes[0]
ppc_mean = ppc_omega.mean(axis=0)
ppc_lo, ppc_hi = np.percentile(ppc_omega, [3, 97], axis=0)
ax.fill_between(pre_years, ppc_lo, ppc_hi, alpha=0.3, color=cols[1],
                label="94% PPC interval")
ax.plot(pre_years, ppc_mean, color=cols[1], linewidth=1.5, label="PPC mean")
ax.plot(pre_years, y_tr_pre_obs, "ko", markersize=5,
        label="Observed $\\bar{Y}_{\\mathrm{tr},t}$")
ax.set(xlabel="Year", ylabel="Standardised sales",
       title="Module 1: Treated pre-treatment means")
ax.legend(fontsize=8)

# Right: Module 2 — calibration scatter
ax = axes[1]
mu_lam_median = np.median(mu_lambda, axis=0)
ppc_lo_lam, ppc_hi_lam = np.percentile(ppc_lambda, [3, 97], axis=0)
ax.errorbar(y_co_post_obs, mu_lam_median,
            yerr=[mu_lam_median - ppc_lo_lam, ppc_hi_lam - mu_lam_median],
            fmt="o", color=cols[1], markersize=4, linewidth=1, alpha=0.7,
            label="Control units")
lims = [min(y_co_post_obs.min(), mu_lam_median.min()) - 0.2,
        max(y_co_post_obs.max(), mu_lam_median.max()) + 0.2]
ax.plot(lims, lims, "k--", linewidth=1, alpha=0.5, label="Identity line")
ax.set(xlabel="Observed $\\bar{Y}_{i,\\mathrm{post}}$",
       ylabel="Predicted $\\bar{Y}_{i,\\mathrm{post}}$",
       title="Module 2: Control post-treatment means")
ax.legend(fontsize=8)

fig.tight_layout()

The left panel shows Module 1 fits the pre-treatment data well where the mean of the observed treated-unit sits inside the 94% PPC intervals. The right panel shows Module 2’s predictions clustering along the identity line. This is indicative of a well calibrated model. Tight intervals reflect the small posterior values of $\sigma_\omega$ and $\sigma_\lambda$, confirming that the matching variances aren’t papering over poor structural fit.

A note on workflow: numpyro.infer.Predictive (which I use in the companion Bayesian SC post) is the more standard way to generate posterior predictive samples in NumPyro. However, I compute the checks manually here to aid understanding. For Module 3 ($\tau$), Predictive cannot help precisely because $\tau$ is outside the model.

Counterfactual trajectories

The synthetic California trajectory $\hat{Y}_t^{\text{sc}} = \omega_0 + \sum_j \omega_j Y_{jt}$, computed for each posterior draw, gives a full posterior distribution over the counterfactual. The pre-treatment portion of this trajectory serves as a visual validation of the matching: if synthetic California tracks observed California well before 1989, the post-treatment divergence is a credible causal estimate. The gap in the post-treatment period between California and synthetic California is the implied treatment effect $\tau$: the estimated reduction in annual per-capita cigarette sales (packs) caused by Proposition 99. Negative values indicate the policy reduced smoking.

Show counterfactual plotting

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
fig = plot_counterfactual(
    mcmc.get_samples(),
    np.array(Y),
    N_co,
    T_pre,
    [int(y) for y in years],
    treatment_year,
    Y_mean=Y_mean,
    Y_std=Y_std,
)

Comparison to frequentist estimators

Table 3 of Arkhangelsky et al. (2021) reports point estimates and Jackknife standard errors for five estimators applied to Proposition 99. I overlay the SDiD, SC, and DiD estimates on both the weight-only posterior and the posterior predictive (which includes outcome noise) to position the cut-posterior estimate’s within the scope of the original work.

Show comparison plot

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# Frequentist estimates from Arkhangelsky et al. (2021), Table 3
freq_estimates = {
    "SDiD": (-15.6, 8.4),
    "SC": (-19.6, 9.9),
    "DiD": (-27.3, 17.7),
}

tau_draws = tau_packs.flatten()
tau_pp_draws = tau_pp.flatten()
freq_cols = [cols[2], cols[3], cols[4]]

fig, axes = plt.subplots(2, 3, figsize=(12, 6), sharey="row", sharex=True)

for j, (name, (est, se)), fc in zip(range(3), freq_estimates.items(), freq_cols):
    for i, (draws, row_label) in enumerate([
        (tau_draws, "Weight-only"),
        (tau_pp_draws, "Posterior predictive"),
    ]):
        ax = axes[i, j]
        ax.axvline(est, color=fc, linewidth=2, linestyle="--",
                   label=f"{name} = {est:.1f}")
        ax.axvspan(est - se, est + se, alpha=0.075, color=fc,
                   label=f"{name} $\\pm$ 1 SE")
        ax.hist(draws, bins=50, density=True, alpha=0.75, color=cols[1],
                edgecolor="white", label=row_label)
        ax.axvline(draws.mean(), color="black", linewidth=1.5, linestyle="--",
                   label=f"Bayesian mean = {draws.mean():.1f}")
        if j == 0:
            ax.set_ylabel("Density")
        if i == 0:
            ax.set_title(f"vs. {name}")
        if i == 1:
            ax.set_xlabel("Treatment effect (packs)")

# Row labels
for i, label in enumerate(["Weight-only posterior", "Posterior predictive"]):
    axes[i, 0].annotate(label, xy=(-0.35, 0.5), xycoords="axes fraction",
                        fontsize=9, ha="center", va="center", rotation=90)

handles, labels = [], []
for ax in axes.flat:
    for h, l in zip(*ax.get_legend_handles_labels()):
        if l not in labels:
            handles.append(h)
            labels.append(l)

fig.legend(handles, labels, loc="lower center", ncol=4, fontsize=8,
           bbox_to_anchor=(0.5, -0.06))
fig.tight_layout()
fig.subplots_adjust(bottom=0.16)

The Bayesian posterior mean sits close to the frequentist SDiD point estimate, confirming that the cut-posterior formulation recovers a comparable signal. The top row shows the weight-only posterior, which is substantially tighter than the frequentist confidence intervals. The bottom row adds outcome noise: the posterior predictive is wider and comparable in magnitude to the Jackknife SEs, though somewhat more diffuse. The Discussion section below unpacks why the two sources of uncertainty differ.

Discussion

The cut-posterior formulation keeps the three-stage structure of Arkhangelsky et al. (2021) intact while adding uncertainty quantification through Bayesian inference. Weight posteriors are the main interpretive gain over the frequentist point estimates as they show not just which states and years receive weight, but how confident the model is in those allocations. I anticipate that this can be useful for both model introspection and also experimental design / unit selection.

Sources of uncertainty: weight posterior vs. outcome noise

The weight-only posterior for $\tau$ is tighter than the frequentist standard error. For Proposition 99, the weight-only posterior standard deviation is $\approx 2.4$; Arkhangelsky et al. report $\hat{\tau} = -15.6$ with a standard error $= 8.4$. That’s a roughly 3.5$\times$ difference in uncertainty, and it reflects a genuine difference in what the two intervals measure.

The frequentist standard accounts for idiosyncratic outcome noise $\varepsilon_{it}$ across units. It answers: “how variable would $\hat{\tau}$ be across repeated samples from the same data-generating process?” This includes weight estimation variability and residual outcome noise.

Conversely, the weight-only posterior standard deviation reflects only uncertainty of the cut-model’s weights. Different posterior draws of $(\omega, \lambda)$ yield different estimates of $\tau$, but there is no residual noise model for the outcomes as $\tau$ is a deterministic function of the weights and the observed data. Because $\sigma_\omega$ and $\sigma_\lambda$ are small (the data strongly constrain the weights), the weight-only posterior over $\tau$ concentrates accordingly.

Finally,the posterior predictive yields wider intervals by adding a residual noise term. The key subtlety is that $\hat{\tau}$ is a weighted average, not a single observation. The double-difference applies a time-weight vector $\mathbf{c}$ to the gap series $g_t$, where $c_t = -\lambda_t$ for pre-treatment and $c_t = 1/T_{\text{post}}$ for post-treatment periods. Under independence across units, the noise variance for each posterior draw $(\omega, \lambda)$ is:

$$V(\omega, \lambda) = \left(1 + \|\omega\|_2^2\right) \cdot \mathbf{c}^\top \hat{\Sigma} \, \mathbf{c},$$

where $\hat{\Sigma}$ is the $T \times T$ within-unit covariance matrix estimated from the de-meaned control panel. The quadratic form $\mathbf{c}^\top \hat{\Sigma} \, \mathbf{c}$ captures both the variance reduction from averaging over post-treatment periods, and the covariance between pre- and post-treatment residuals. The latter is crucial, as the time weights concentrate on the years just before treatment because those periods are most correlated with post-treatment outcomes, and this correlation reduces the variance of the double-difference. An iid approximation would miss this covariance term entirely, producing intervals that are too wide. The unit factor $(1 + \|\omega\|_2^2)$ reflects the additional variance from the treated unit’s residuals and the synthetic control’s aggregation noise. For each draw I perturb $\tau$ by $\varepsilon \sim \mathcal{N}(0, \sqrt{V(\omega, \lambda)})$.

An alternative approach would be to add a full outcome likelihood as a third module inside the MCMC sampler, writing $Y_{it} = \alpha_i + \beta_t + \tau D_{it} + \varepsilon_{it}$ with jax.lax.stop_gradient on $\omega$ and $\lambda$ to block feedback to the weights. This would constitute a formal cut posterior and would be more “fully Bayesian” in that $\sigma_\varepsilon$ would be sampled jointly with $\tau$. However, it reintroduces the $N + T - 1$ fixed-effect parameters that the double-difference formula was designed to concentrate out, and the weighted heteroscedastic likelihood (with effective variance $\sigma_y / \sqrt{w_{it}}$, where $w_{it}$ can span several orders of magnitude) creates a challenging posterior geometry for NUTS. In preliminary experiments, this parameterisation produced R-hat values $> 2$ and multimodal posteriors, suggesting that the identifiability issues in the TWFE parameterisation interact poorly with the stopped-gradient landscape. For this dataset, the analytical plug-in approach avoids these difficulties entirely while producing comparable point estimates.

The placebo procedure used in Arkhangelsky et al. (2021) draws $N_{\text{tr}}$ units from the control pool without replacement and re-runs the full SDiD procedure — including re-estimating weights — for each placebo assignment. Its variance estimate therefore integrates over which control units happen to play the role of “treated,” implicitly averaging across weight configurations that are each tailored to their placebo assignment. In contrast, the posterior predictive conditions on the observed panel and integrates over weight uncertainty and outcome noise for the actual treated unit. These are subtly different questions: the placebo asks “how variable is $\hat{\tau}$ across reassignments of the treatment label?”, while the posterior predictive asks “how uncertain are we about $\tau$ given this particular treatment assignment and the noise structure of the panel?” The fact that the two estimates are within 20% of each other is reassuring, but a full calibration study — comparing coverage across simulation designs like those in Section 3 of the original paper — would be needed to determine which better tracks the true sampling variability. This post is already quite long though, so I may leave that for another post…

Extension to multiple treated units

The Proposition 99 application has a single treated unit ($N_{\text{tr}} = 1$), but the formulation extends to $N_{\text{tr}} > 1$ with minimal changes. The frequentist SDiD of Arkhangelsky et al. handles multiple treated units by collapsing them to their cross-sectional mean before weight optimisation, assigning each treated unit uniform weight $1/N_{\text{tr}}$ in the regression. The code above already does this: the matching target is Y[N_co:, :T_pre].mean(axis=0), which averages across treated units.

The time-weight module uses only control data and is unchanged. The double-difference formula is structurally identical and only the definition of the treated outcome changes through posterior predictive variance. The unit factor $(1 + \|\omega\|_2^2)$ becomes $(1/N_{\text{tr}} + \|\omega\|_2^2)$, because the treated mean has variance $1/N_{\text{tr}}$ per period rather than $1$. More treated units directly shrink the posterior predictive intervals.

A natural Bayesian extension would go further: rather than collapsing to the treated mean, one could estimate separate weight vectors per treated unit within a hierarchical model, yielding a posterior over unit-level effects $\tau_i$ rather than just the average. This connects to the partially pooled synthetic control of Ben-Michael, Feller & Rothstein (2022), where the degree of pooling across treated units is itself estimated from the data.

Limitations

The matching likelihoods’ variances $\sigma_\omega^2$ and $\sigma_\lambda^2$ have no direct frequentist counterpart. They control how tightly the weights must satisfy the balancing condition: too small and the weight posterior approaches the frequentist point estimate, whilst too large and the weights become diffuse and uninformative. I give them half-normal priors and let the data inform their scale, but sensitivity to this choice deserves attention in applied settings.

The softmax parameterisation forces all weights to be strictly positive. The frequentist solution can produce exact zeros (sparse weights). Where sparsity matters, a Dirichlet prior with concentration $< 1$ or a spike-and-slab prior on the logits would be more appropriate. I think for anyone interested in such an approach, my post on Bayesian SC post should contain the necessary information.

The matching modules use simple Normal likelihoods and do not explicitly model temporal dependence within the panel. That is consistent with the original SDiD estimator, which also does not directly specify a time-series model for the panel, but it remains a modelling limitation of the current Bayesian formulation. A natural extension would be to replace the working likelihoods with a structured temporal model, or otherwise introduce serial dependence directly into the matching modules, but that is beyond the scope of this post.

Acknowledgements

Thanks to Philipp Baumann, Juan Orduz, and Theo Rashid for reading an earlier version of this post. Additionally, as mentioned in the introduction, this post was written with the support of Opus 4.6. I think it’s worth acknowledging that by asking Opus 4.6 to do tasks such as “identify relevant papers in the literature” or “implement code to create a plot of counterfactual”, I was able to write this post. Of course it would have been possible without AI; however, the time required to do this outside of my job would have been significantly more burdensome.