Causal inference on MIMIC-II#
In the ICU, sicker patients receive more invasive monitoring. An indwelling arterial line (an ‘A-line’) gives clinicians continuous, accurate blood-pressure readings — useful when a patient is unstable — but the line itself carries infection and bleeding risk, and inserting it takes time and physician attention that might be spent elsewhere. A naïve mortality comparison between patients who did and did not receive an A-line shows the A-line group does worse on average. That should not be taken at face value: those patients were sicker to begin with, and any observational comparison conflates the treatment effect with the indication for treatment.
Causal inference is the toolkit for asking would mortality have been different if this patient had (not) received the line? — the kind of counterfactual question randomised trials answer by construction, and that observational analyses can only approach with explicit assumptions and adjustments. This tutorial walks through that adjustment on the MIMIC-II IAC cohort using four complementary estimators that ehrapy ships out of the box. By the end you should have a feel for what each estimator does, when to reach for which, and how to read their joint output.
Setup#
ehrapy plots are HoloViews objects rendered via Bokeh.
Calling hv.extension("bokeh") once in the notebook registers the inline display hooks so that returning an hv.Overlay from a cell renders as an interactive plot.
The cohort#
ed.dt.mimic_2_preprocessed() is the canonical 1,776-patient MIMIC-II ICU cohort assembled for the textbook Secondary Analysis of Electronic Health Records (Hsu, Mark et al., 2016).
It already includes a binary A-line flag, a 28-day mortality flag, and a panel of admission severity scores, so we can dive straight into the analysis.
edata = ed.dt.mimic_2_preprocessed()
edata
! This operation does not affect numeric layer X.
! This operation does not affect numeric layer original.
! This operation does not affect numeric layer raw_norm.
EHRData object with n_obs × n_vars × n_t = 1776 × 46 × 1
obs: 'service_unit', 'day_icu_intime', 'missing_values_abs', 'missing_values_pct'
var: 'missing_values_abs', 'missing_values_pct', 'mean', 'median', 'standard_deviation', 'min', 'max'
uns: 'day_icu_intime_colors', 'encoding_to_var', 'neighbors', 'non_numerical_columns', 'normalization', 'numerical_columns', 'original_values_categoricals', 'pca', 'service_unit_colors', 'umap', 'var_to_encoding'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
layers: 'original', 'raw_norm'
obsp: 'connectivities', 'distances'
shape of .X: (1776, 46)
shape of .original: (1776, 46)
shape of .raw_norm: (1776, 46)
The question#
We want the causal effect of A-line placement (aline_flg) on 28-day all-cause mortality (day_28_flg).
Both are binary variables encoded in edata.X.
The choice of adjustment set is the most consequential modelling decision in any causal analysis: every confounder we omit weakens the ‘no unmeasured confounding’ assumption, and every collider or post-treatment variable we include can re-introduce bias. We adjust for the severity-at-admission variables that physicians demonstrably condition on when deciding whether to insert an A-line, plus basic demographics.
age,gender_num,weight_first,bmi— baseline demographics.sapsi_first,sofa_first— SAPS I and SOFA, the two routine first-day severity scores. Sicker patients (higher SOFA) are far more likely to receive an A-line and far more likely to die, so omitting these would leave the strongest confounding pathway open.icu_los_day— ICU length of stay.sepsis_flg— whether the patient was septic, a strong indication for invasive monitoring.
treatment = "aline_flg"
outcome = "day_28_flg"
covariates = ["age", "gender_num", "sapsi_first", "sofa_first", "weight_first", "bmi", "icu_los_day", "sepsis_flg"]
Before fitting anything: check the assumptions#
Every causal estimator below relies on three identifiability assumptions. Two of them — positivity and no unmeasured confounding — can be partially probed with the data; one — consistency — cannot. Looking at these before fitting any estimator is good hygiene: if positivity fails, the point estimate you get out the other side is not interpretable no matter how well the rest of the pipeline behaves.
The three identifiability assumptions
Consistency. The observed outcome under treatment T = t equals the potential outcome Y(t).
This rules out interference between patients and version-of-treatment ambiguity.
Not testable from the data.
No unmeasured confounding. Conditional on the adjustment set X, treatment is independent of the potential outcomes.
This is the assumption the adjustment set is supposed to enforce; we can only justify it from domain knowledge.
Positivity (or overlap, or common support).
For every value of X that occurs in the cohort, the propensity P(T = 1 | X) lies strictly between 0 and 1.
A patient who must receive the treatment (or must not) contributes no counterfactual information; estimators that pretend otherwise inflate variance and bias.
This can be checked from the data and is what we do next.
Positivity#
positivity_check fits a propensity model e(X) = P(T = 1 | X) and reports the fraction of patients whose propensity sits inside the common-support interval [ε, 1 − ε].
Plotting the distributions of e(X) per arm makes any non-overlap immediately visible.
info = ep.tl.positivity_check(edata, treatment, covariates=covariates)
print(f"Support fraction: {info['support_fraction']:.3f}")
ep.pl.propensity_overlap(info)
Support fraction: 0.933
About 93% of patients have a propensity score in [0.05, 0.95] — the bulk of the cohort lies in the region where both arms are represented, so estimators should be reasonably stable.
Notice, however, the long right tail of the treated-arm propensity distribution running close to 1.0: a non-trivial subset of patients are almost certain to receive an A-line given their severity profile.
These observations will receive very large IPTW weights and we will need to revisit them if any single estimator behaves strangely later.
Covariate balance before adjustment#
The standardised mean difference (SMD) summarises how unevenly each covariate is distributed across the two arms.
Anything with |SMD| > 0.1 is the clinically conventional threshold for ‘imbalanced’.
Severe imbalance tells us how much work the adjustment is going to have to do.
bal = ep.tl.covariate_balance(edata, treatment, covariates=covariates)
bal.sort_values("smd_unweighted", key=abs, ascending=False)
| smd_unweighted | smd_weighted | var_ratio_unweighted | var_ratio_weighted | |
|---|---|---|---|---|
| sofa_first | 0.818499 | -0.127626 | 1.135252 | 0.470310 |
| icu_los_day | 0.714340 | -0.389266 | 4.480389 | 0.297693 |
| sapsi_first | 0.627175 | -0.088683 | 1.112275 | 0.900613 |
| age | 0.116635 | 0.011368 | 0.895828 | 1.092501 |
| weight_first | 0.071699 | 0.080477 | 1.023488 | 1.098264 |
| gender_num | 0.045530 | 0.144127 | 0.987032 | 0.975377 |
| bmi | -0.019164 | -0.080353 | 0.704852 | 0.632581 |
| sepsis_flg | 0.000000 | 0.000000 | NaN | NaN |
The pattern is exactly what we expected: SOFA, ICU LOS, and SAPS I have unweighted SMDs above 0.6, with treated patients sicker by every severity measure. The naïve mortality difference between groups is therefore largely driven by selection of A-lines toward sicker patients, not by the line itself. This is the indication-bias problem in EHR observational research in its purest form, and is what the next four estimators are designed to address.
Inverse probability of treatment weighting#
IPTW is the most direct response to indication bias. We re-weight every patient by the inverse probability that they received the treatment they actually received; in the resulting pseudo-population, the covariate distribution is the same in both arms, so a simple weighted mortality difference recovers the ATE.
How IPTW works
Fit e(X) = P(T = 1 | X).
Form weights w_i = T_i / e(X_i) + (1 − T_i) / (1 − e(X_i)).
By construction, E[T · 1{X = x} · w] = P(X = x) and the same for 1 − T, so the weighted sample mimics a randomised trial in expectation.
The Hájek difference of weighted outcome means estimates the ATE.
Two refinements matter in practice: stabilisation multiplies the weights by the marginal treatment probability P(T = 1) to shrink high-leverage observations without changing the target estimand, and clipping (typically to [0.01, 0.99]) caps the influence of near-extreme propensity scores.
Both are on by default.
est_iptw = ep.tl.iptw(edata, treatment, outcome, covariates=covariates, n_bootstrap=200, random_state=0)
print(est_iptw.summary())
Causal effect of 'aline_flg' on 'day_28_flg'
method: iptw_stabilized
ATE: -0.0493
SE: 0.0435
95% CI: [-0.1500, 0.0249]
n: 1776
The point estimate flips the naïve picture: after re-weighting, A-line patients have lower 28-day mortality by about 5 percentage points, with a 95% bootstrap CI that just crosses zero. That sign reversal is the signature of indication bias being resolved: in the raw cohort, sicker patients (more likely to be wired up) were also more likely to die; once we re-weight to match the covariate distributions, the line itself looks protective — though not unambiguously so.
Did the weights actually balance the design?
Re-running covariate_balance with the IPTW weights tells us.
w_full = np.full(edata.n_obs, np.nan)
w_full[edata.obs.index.get_indexer(est_iptw.params["index"])] = est_iptw.params["weights"]
bal_after = ep.tl.covariate_balance(edata, treatment, covariates=covariates, weights=w_full)
ep.pl.love_plot(bal_after)
All weighted SMDs are inside ±0.1, so the propensity model succeeded at the balancing task.
If any covariate had stayed above 0.1 after weighting we would re-examine the propensity model — either by adding interaction terms or by switching to a more flexible learner like gradient boosting.
G-computation#
IPTW asks ‘what would the marginal mortality look like if every patient had been assigned to each arm with equal probability?’.
G-computation answers the same question from the opposite direction: fit an outcome model μ(T, X) directly and average its counterfactual predictions.
The estimators are complementary — they fail in different ways — so it’s standard to run both and compare.
How g-computation works
Fit μ(T, X) = E[Y | T, X] on the observed data using any sklearn-compatible regressor.
For each row i, predict μ(1, X_i) and μ(0, X_i) — the model’s best guess for mortality if that patient had and had not received the treatment, holding their covariates fixed.
The ATE is the average of those individual contrasts: mean(μ(1, X)) − mean(μ(0, X)).
G-computation is more efficient than IPTW when the outcome model is correctly specified, but is biased if the model is misspecified outside the support of either arm. It pairs naturally with AIPW (below), which is robust to misspecification of either the propensity or outcome model.
est_g = ep.tl.g_computation(edata, treatment, outcome, covariates=covariates, n_bootstrap=200, random_state=0)
print(est_g.summary())
Causal effect of 'aline_flg' on 'day_28_flg'
method: g_computation
ATE: -0.0257
SE: 0.0175
95% CI: [-0.0579, 0.0075]
n: 1776
The g-computation ATE is about half the IPTW ATE. Both estimators agree on the sign — A-line is associated with lower mortality after adjustment — but their magnitudes differ. Disagreement of this size is normal and usually reflects how the two methods extrapolate differently in regions of poor overlap. It’s exactly the kind of methodological tension that AIPW is designed to resolve.
AIPW (doubly robust)#
AIPW combines the propensity and outcome models so that the resulting estimator is consistent if either one is correct, not necessarily both. When both are correct it attains the semi-parametric efficiency bound, which is why it’s the default workhorse in modern causal pipelines.
The AIPW influence function
For every observation define
ψ_i = μ_1(X_i) − μ_0(X_i)
+ (T_i / e_i) (Y_i − μ_1(X_i))
− ((1 − T_i) / (1 − e_i)) (Y_i − μ_0(X_i))
The first row is the g-computation contrast.
The next two rows are weighted residual corrections: they add back the part of the outcome that the model failed to predict, weighted by how unlikely the observed treatment was.
When the outcome model is wrong, the residual terms absorb the bias; when the propensity is wrong, the g-formula does.
The empirical standard error of ψ divided by √n is the analytic SE, no bootstrap required.
est_aipw = ep.tl.aipw(edata, treatment, outcome, covariates=covariates)
print(est_aipw.summary())
Causal effect of 'aline_flg' on 'day_28_flg'
method: aipw
ATE: -0.0347
SE: 0.0444
95% CI: [-0.1216, 0.0522]
n: 1776
AIPW lands in between IPTW and g-computation as expected. Its analytic SE is comparable to the bootstrap SEs we got from the other two estimators, so we’re not paying a precision penalty for the doubly-robust guarantee. If a future analyst replaces the propensity model with something fancier (a stacked learner, a tree ensemble) AIPW will protect us against propensity misspecification, and if they replace the outcome model it will protect against outcome misspecification — that’s the whole point.
Propensity score matching#
Matching is the most intuitive of the four methods and historically the dominant one in clinical epidemiology. Each treated patient is paired with the closest untreated patient in propensity score space, and the effect is the mean within-pair outcome difference.
How matching pairs treated and control patients
Fit a propensity model and transform every patient’s propensity to the logit scale (because logit-distances are more linear in the propensity than raw-probability distances).
For each treated patient, find the k untreated patients with the smallest absolute logit-propensity distance — with optional caliper (max-distance cutoff) and replacement (whether a control patient can be reused).
Treated patients whose nearest match falls outside the caliper are dropped, which is the price of guaranteeing every retained pair is genuinely comparable.
With target="att", the estimand is the average effect among the treated; with target="ate", controls are also matched to treated patients to recover the population-average effect.
est_psm = ep.tl.propensity_score_matching(edata, treatment, outcome, covariates=covariates, n_bootstrap=200, random_state=0)
print(est_psm.summary())
Causal effect of 'aline_flg' on 'day_28_flg'
method: propensity_score_matching_att
ATE: -0.0352
SE: 0.0359
95% CI: [-0.1406, 0.0138]
n: 1776
Matching gives a point estimate close to AIPW.
The matched-pair count is reported in est_psm.params["matches"]["n_matched_pairs"] — worth checking that we haven’t dropped too many treated patients due to the caliper, especially in EHR cohorts where extreme-propensity patients exist.
Side-by-side comparison#
Putting the four estimators on one axis is the clinical-epidemiology equivalent of a sensitivity analysis: if a finding is robust, it should not depend on which estimator we chose to run.
ep.pl.causal_effect(
est_aipw,
other={"iptw": est_iptw, "g-comp": est_g, "psm": est_psm},
)
All four point estimates fall between roughly −0.05 and −0.03 — a small protective association of A-line placement on 28-day mortality, with the IPTW estimate slightly more negative than the others.
Every confidence interval crosses zero; we can’t claim statistical significance from this dataset alone.
The sign reversal relative to the naïve +0.03 mortality gap is, however, the headline result.
Two sober takeaways:
Severity confounding in MIMIC-II is large — large enough to flip the apparent direction of the effect. Any observational analysis on this dataset that does not adjust for SOFA/SAPS/ICU LOS will get the sign wrong.
A 1,776-patient cohort is too small to definitively answer this question. The point estimate is consistent with a small protective effect, but the CIs do not exclude no effect. The published RCT literature on indwelling arterial lines reaches the same equivocal conclusion.
Where to go from here#
We’ve estimated the average effect over the entire cohort. If A-line placement helps sicker patients more (or only sicker patients) and hurts milder patients, that pattern is invisible in the ATE. The companion notebook, Heterogeneous treatment effects on MIMIC-II, uses meta-learners to estimate the per-patient effect and asks whether the answer depends on baseline severity.