MIMIC-II Treatment Effect Estimation with ehrapy

MIMIC-II Treatment Effect Estimation with ehrapy#

Here, we estimate treatment effects using survival analysis techniques.

Case Study Data#

This tutorial explores the MIMIC-II IAC dataset. It was created for the purpose of a case study in the book: Secondary Analysis of Electronic Health Records, published by Springer in 2016. In particular, the dataset was used throughout Chapter 16 (Data Analysis) by Raffa J. et al. to investigate the effectiveness of indwelling arterial catheters in hemodynamically stable patients with respiratory failure for mortality outcomes. The dataset is derived from MIMIC-II, the publicly accessible critical care database. It contains summary clinical data and outcomes for 1,776 patients.

References:

[1] Critical Data, M.I.T., 2016. Secondary analysis of electronic health records (p. 244). Springer Nature. (https://link.springer.com/book/10.1007/978-3-319-43742-2)

[2] https://github.com/MIT-LCP/critical-data-book/tree/master/part_ii/chapter_16/jupyter

import ehrapy as ep
import ehrdata as ed
import pandas as pd
edata = ed.dt.mimic_2()
! File ehrapy_data/ehrapy_mimic2.csv already exists! Using already downloaded dataset...
ed.infer_feature_types(edata)
ed.replace_feature_types(edata, ["day_icu_intime_num", "hour_icu_intime"], "numeric")
! Features 'aline_flg', 'gender_num', 'service_num', 'day_icu_intime_num', 'hour_icu_intime', 'hosp_exp_flg', 'icu_exp_flg', 'day_28_flg', 'censor_flg', 'sepsis_flg', 'chf_flg', 'afib_flg', 'renal_flg', 'liver_flg', 'copd_flg', 'cad_flg', 'stroke_flg', 'mal_flg', 'resp_flg' were detected as categorical features stored numerically.Please verify and adjust if necessary using `ed.replace_feature_types`.
 Detected feature types for EHRData object with 1776 obs and 46 vars
╠══ πŸ“… Date features
╠══ πŸ“ Numerical features
β•‘   ╠══ abg_count
β•‘   ╠══ age
β•‘   ╠══ bmi
β•‘   ╠══ bun_first
β•‘   ╠══ chloride_first
β•‘   ╠══ creatinine_first
β•‘   ╠══ hgb_first
β•‘   ╠══ hospital_los_day
β•‘   ╠══ hr_1st
β•‘   ╠══ icu_los_day
β•‘   ╠══ iv_day_1
β•‘   ╠══ map_1st
β•‘   ╠══ mort_day_censored
β•‘   ╠══ pco2_first
β•‘   ╠══ platelet_first
β•‘   ╠══ po2_first
β•‘   ╠══ potassium_first
β•‘   ╠══ sapsi_first
β•‘   ╠══ sodium_first
β•‘   ╠══ sofa_first
β•‘   ╠══ spo2_1st
β•‘   ╠══ tco2_first
β•‘   ╠══ temp_1st
β•‘   ╠══ wbc_first
β•‘   β•šβ•β• weight_first
β•šβ•β• πŸ—‚οΈ Categorical features
    ╠══ afib_flg (2 categories)
    ╠══ aline_flg (2 categories)
    ╠══ cad_flg (2 categories)
    ╠══ censor_flg (2 categories)
    ╠══ chf_flg (2 categories)
    ╠══ copd_flg (2 categories)
    ╠══ day_28_flg (2 categories)
    ╠══ day_icu_intime (7 categories)
    ╠══ day_icu_intime_num (7 categories)
    ╠══ gender_num (2 categories)
    ╠══ hosp_exp_flg (2 categories)
    ╠══ hour_icu_intime (24 categories)
    ╠══ icu_exp_flg (2 categories)
    ╠══ liver_flg (2 categories)
    ╠══ mal_flg (2 categories)
    ╠══ renal_flg (2 categories)
    ╠══ resp_flg (2 categories)
    ╠══ sepsis_flg (1 categories)
    ╠══ service_num (2 categories)
    ╠══ service_unit (3 categories)
    β•šβ•β• stroke_flg (2 categories)
continuous_vars = edata.var.index[edata.var["feature_type"] == "numeric"]
categorical_vars = edata.var.index[edata.var["feature_type"] == "categorical"]

continuous_vars = continuous_vars.tolist()
categorical_vars = categorical_vars.tolist()
all_vars = continuous_vars + categorical_vars

Case Study and Summary#

Our Objective for this case study is as follows:


To estimate the effect that administration of IAC (Indwelling Arterial Catheter) during an ICU (Intensive Care Unit) admission has on 28 day mortality in patients without sepsis who received mechanical ventilation within MIMIC II, while adjusting for age, gender, severity of illness and comorbidities.



We will also not want to include the sepsis_flg variable as a covariate in any of our models, as there are no patients with sepsis within this study to estimate the effect of sepsis.

Usually, we will want to see how the population differs for different values of the covariate of interest. In our case study, if the treated group (IAC) differed substantially from the untreated group (no IAC), then this may account for any effect we demonstrate. We can do this by summarizing the two groups using the tableone package.

import tableone

table = tableone.TableOne(edata.to_df(), all_vars, categorical=categorical_vars, groupby="aline_flg")
table.tabulate(tablefmt="html")
Missing Overall 0 1
n 1776 792 984
icu_los_day, mean (SD) 0 3.3 (3.4) 2.1 (1.9) 4.3 (3.9)
hospital_los_day, mean (SD) 0 8.1 (8.2) 5.4 (5.4) 10.3 (9.3)
age, mean (SD) 0 54.4 (21.1) 53.0 (21.7) 55.5 (20.5)
weight_first, mean (SD) 110 80.1 (22.5) 79.2 (22.6) 80.7 (22.4)
bmi, mean (SD) 466 27.8 (8.2) 28.0 (9.1) 27.7 (7.5)
sapsi_first, mean (SD) 85 14.1 (4.1) 12.7 (3.8) 15.2 (4.0)
sofa_first, mean (SD) 6 5.8 (2.3) 4.8 (2.1) 6.6 (2.2)
day_icu_intime_num, mean (SD) 0 4.1 (2.0) 4.0 (2.0) 4.1 (2.0)
hour_icu_intime, mean (SD) 0 10.6 (7.9) 9.9 (7.7) 11.2 (8.1)
mort_day_censored, mean (SD) 0 614.3 (403.1) 619.1 (388.3) 610.5 (414.8)
map_1st, mean (SD) 0 88.2 (17.6) 87.5 (15.9) 88.9 (18.8)
hr_1st, mean (SD) 0 87.9 (18.8) 88.4 (18.8) 87.5 (18.7)
temp_1st, mean (SD) 3 97.8 (4.5) 97.9 (3.8) 97.7 (5.1)
spo2_1st, mean (SD) 0 98.4 (5.5) 98.4 (5.7) 98.5 (5.4)
abg_count, mean (SD) 0 6.0 (8.7) 1.4 (1.6) 9.7 (10.2)
wbc_first, mean (SD) 8 12.3 (6.6) 11.7 (6.5) 12.8 (6.6)
hgb_first, mean (SD) 8 12.6 (2.2) 12.7 (2.2) 12.4 (2.2)
platelet_first, mean (SD) 8 246.1 (99.9) 254.3 (104.5) 239.5 (95.6)
sodium_first, mean (SD) 5 139.6 (4.7) 139.8 (4.8) 139.4 (4.7)
potassium_first, mean (SD) 5 4.1 (0.8) 4.1 (0.8) 4.1 (0.8)
tco2_first, mean (SD) 5 24.4 (5.0) 24.7 (4.9) 24.2 (5.1)
chloride_first, mean (SD) 5 103.8 (5.7) 103.3 (5.4) 104.3 (5.9)
bun_first, mean (SD) 5 19.3 (14.4) 18.9 (14.5) 19.6 (14.3)
creatinine_first, mean (SD) 6 1.1 (1.1) 1.1 (1.2) 1.1 (1.0)
po2_first, mean (SD) 186 227.6 (144.9) 223.8 (152.9) 230.1 (139.6)
pco2_first, mean (SD) 186 43.4 (14.0) 44.9 (15.9) 42.5 (12.5)
iv_day_1, mean (SD) 143 1622.9 (1677.1)1364.2 (1406.8)1808.4 (1825.0)
aline_flg, n (%) 0 792 (44.6) 792 (100.0)
1 984 (55.4) 984 (100.0)
gender_num, n (%) 0.0 750 (42.2) 344 (43.4) 406 (41.3)
1.0 1025 (57.7) 447 (56.4) 578 (58.7)
None 1 (0.1) 1 (0.1)
service_unit, n (%) FICU 62 (3.5) 24 (3.0) 38 (3.9)
MICU 732 (41.2) 480 (60.6) 252 (25.6)
SICU 982 (55.3) 288 (36.4) 694 (70.5)
service_num, n (%) 0 794 (44.7) 504 (63.6) 290 (29.5)
1 982 (55.3) 288 (36.4) 694 (70.5)
day_icu_intime, n (%) Friday 234 (13.2) 87 (11.0) 147 (14.9)
Monday 260 (14.6) 116 (14.6) 144 (14.6)
Saturday 278 (15.7) 138 (17.4) 140 (14.2)
Sunday 230 (13.0) 114 (14.4) 116 (11.8)
Thursday 261 (14.7) 110 (13.9) 151 (15.3)
Tuesday 257 (14.5) 118 (14.9) 139 (14.1)
Wednesday 256 (14.4) 109 (13.8) 147 (14.9)
hosp_exp_flg, n (%) 0 1532 (86.3) 702 (88.6) 830 (84.3)
1 244 (13.7) 90 (11.4) 154 (15.7)
icu_exp_flg, n (%) 0 1606 (90.4) 734 (92.7) 872 (88.6)
1 170 (9.6) 58 (7.3) 112 (11.4)
day_28_flg, n (%) 0 1493 (84.1) 679 (85.7) 814 (82.7)
1 283 (15.9) 113 (14.3) 170 (17.3)
censor_flg, n (%) 0 497 (28.0) 213 (26.9) 284 (28.9)
1 1279 (72.0) 579 (73.1) 700 (71.1)
sepsis_flg, n (%) 0 1776 (100.0) 792 (100.0) 984 (100.0)
chf_flg, n (%) 0 1563 (88.0) 695 (87.8) 868 (88.2)
1 213 (12.0) 97 (12.2) 116 (11.8)
afib_flg, n (%) 0 1569 (88.3) 710 (89.6) 859 (87.3)
1 207 (11.7) 82 (10.4) 125 (12.7)
renal_flg, n (%) 0 1716 (96.6) 764 (96.5) 952 (96.7)
1 60 (3.4) 28 (3.5) 32 (3.3)
liver_flg, n (%) 0 1677 (94.4) 754 (95.2) 923 (93.8)
1 99 (5.6) 38 (4.8) 61 (6.2)
copd_flg, n (%) 0 1619 (91.2) 711 (89.8) 908 (92.3)
1 157 (8.8) 81 (10.2) 76 (7.7)
cad_flg, n (%) 0 1653 (93.1) 741 (93.6) 912 (92.7)
1 123 (6.9) 51 (6.4) 72 (7.3)
stroke_flg, n (%) 0 1554 (87.5) 722 (91.2) 832 (84.6)
1 222 (12.5) 70 (8.8) 152 (15.4)
mal_flg, n (%) 0 1520 (85.6) 700 (88.4) 820 (83.3)
1 256 (14.4) 92 (11.6) 164 (16.7)
resp_flg, n (%) 0 1211 (68.2) 514 (64.9) 697 (70.8)
1 565 (31.8) 278 (35.1) 287 (29.2)

The IAC group differs in many respects to the non-IAC group. Patients who were given IAC tended to have higher severity of illness at baseline (sapsi_first and sofa_first), slightly older, less likely to be from the MICU, and have slightly different comorbidity profiles when compared to the non-IAC group.

Next, we can see how the covariates are distributed among the different outcomes (death within 28 days versus alive at 28 days). This will give us an idea of which covariates may be important for affecting the outcome.

table = tableone.TableOne(edata.to_df(), all_vars, categorical=categorical_vars, groupby="day_28_flg")
table.tabulate(tablefmt="html")
Missing Overall 0 1
n 1776 1493 283
icu_los_day, mean (SD) 0 3.3 (3.4) 3.2 (3.2) 4.0 (4.0)
hospital_los_day, mean (SD) 0 8.1 (8.2) 8.4 (8.4) 6.4 (6.4)
age, mean (SD) 0 54.4 (21.1) 50.8 (20.1) 73.3 (15.3)
weight_first, mean (SD) 110 80.1 (22.5) 81.4 (22.7) 72.4 (19.9)
bmi, mean (SD) 466 27.8 (8.2) 28.2 (8.3) 26.0 (7.2)
sapsi_first, mean (SD) 85 14.1 (4.1) 13.6 (3.9) 17.3 (3.8)
sofa_first, mean (SD) 6 5.8 (2.3) 5.7 (2.3) 6.6 (2.4)
day_icu_intime_num, mean (SD) 0 4.1 (2.0) 4.0 (2.0) 4.1 (2.0)
hour_icu_intime, mean (SD) 0 10.6 (7.9) 10.5 (7.9) 11.0 (8.0)
mort_day_censored, mean (SD) 0 614.3 (403.1) 729.6 (331.4) 6.1 (6.4)
map_1st, mean (SD) 0 88.2 (17.6) 88.2 (17.5) 88.3 (17.9)
hr_1st, mean (SD) 0 87.9 (18.8) 88.3 (18.4) 85.8 (20.6)
temp_1st, mean (SD) 3 97.8 (4.5) 97.8 (4.6) 97.7 (4.5)
spo2_1st, mean (SD) 0 98.4 (5.5) 98.6 (5.0) 97.8 (7.6)
abg_count, mean (SD) 0 6.0 (8.7) 5.7 (7.7) 7.5 (12.5)
wbc_first, mean (SD) 8 12.3 (6.6) 12.2 (6.4) 12.7 (7.5)
hgb_first, mean (SD) 8 12.6 (2.2) 12.7 (2.2) 11.9 (2.1)
platelet_first, mean (SD) 8 246.1 (99.9) 246.8 (97.3) 242.1 (112.6)
sodium_first, mean (SD) 5 139.6 (4.7) 139.6 (4.6) 139.1 (5.4)
potassium_first, mean (SD) 5 4.1 (0.8) 4.1 (0.8) 4.2 (0.9)
tco2_first, mean (SD) 5 24.4 (5.0) 24.3 (4.8) 25.0 (5.8)
chloride_first, mean (SD) 5 103.8 (5.7) 104.1 (5.6) 102.6 (6.4)
bun_first, mean (SD) 5 19.3 (14.4) 18.0 (12.9) 26.2 (19.0)
creatinine_first, mean (SD) 6 1.1 (1.1) 1.1 (1.1) 1.2 (0.9)
po2_first, mean (SD) 186 227.6 (144.9) 231.3 (146.3) 207.9 (135.8)
pco2_first, mean (SD) 186 43.4 (14.0) 43.3 (12.9) 43.8 (18.6)
iv_day_1, mean (SD) 143 1622.9 (1677.1)1694.2 (1709.5)1258.0 (1449.4)
aline_flg, n (%) 0 792 (44.6) 679 (45.5) 113 (39.9)
1 984 (55.4) 814 (54.5) 170 (60.1)
gender_num, n (%) 0.0 750 (42.2) 606 (40.6) 144 (50.9)
1.0 1025 (57.7) 886 (59.3) 139 (49.1)
None 1 (0.1) 1 (0.1)
service_unit, n (%) FICU 62 (3.5) 59 (4.0) 3 (1.1)
MICU 732 (41.2) 605 (40.5) 127 (44.9)
SICU 982 (55.3) 829 (55.5) 153 (54.1)
service_num, n (%) 0 794 (44.7) 664 (44.5) 130 (45.9)
1 982 (55.3) 829 (55.5) 153 (54.1)
day_icu_intime, n (%) Friday 234 (13.2) 192 (12.9) 42 (14.8)
Monday 260 (14.6) 221 (14.8) 39 (13.8)
Saturday 278 (15.7) 235 (15.7) 43 (15.2)
Sunday 230 (13.0) 198 (13.3) 32 (11.3)
Thursday 261 (14.7) 223 (14.9) 38 (13.4)
Tuesday 257 (14.5) 211 (14.1) 46 (16.3)
Wednesday 256 (14.4) 213 (14.3) 43 (15.2)
hosp_exp_flg, n (%) 0 1532 (86.3) 1490 (99.8) 42 (14.8)
1 244 (13.7) 3 (0.2) 241 (85.2)
icu_exp_flg, n (%) 0 1606 (90.4) 1493 (100.0) 113 (39.9)
1 170 (9.6) 170 (60.1)
day_28_flg, n (%) 0 1493 (84.1) 1493 (100.0)
1 283 (15.9) 283 (100.0)
censor_flg, n (%) 0 497 (28.0) 214 (14.3) 283 (100.0)
1 1279 (72.0) 1279 (85.7)
sepsis_flg, n (%) 0 1776 (100.0) 1493 (100.0) 283 (100.0)
chf_flg, n (%) 0 1563 (88.0) 1348 (90.3) 215 (76.0)
1 213 (12.0) 145 (9.7) 68 (24.0)
afib_flg, n (%) 0 1569 (88.3) 1372 (91.9) 197 (69.6)
1 207 (11.7) 121 (8.1) 86 (30.4)
renal_flg, n (%) 0 1716 (96.6) 1447 (96.9) 269 (95.1)
1 60 (3.4) 46 (3.1) 14 (4.9)
liver_flg, n (%) 0 1677 (94.4) 1413 (94.6) 264 (93.3)
1 99 (5.6) 80 (5.4) 19 (6.7)
copd_flg, n (%) 0 1619 (91.2) 1377 (92.2) 242 (85.5)
1 157 (8.8) 116 (7.8) 41 (14.5)
cad_flg, n (%) 0 1653 (93.1) 1403 (94.0) 250 (88.3)
1 123 (6.9) 90 (6.0) 33 (11.7)
stroke_flg, n (%) 0 1554 (87.5) 1386 (92.8) 168 (59.4)
1 222 (12.5) 107 (7.2) 115 (40.6)
mal_flg, n (%) 0 1520 (85.6) 1294 (86.7) 226 (79.9)
1 256 (14.4) 199 (13.3) 57 (20.1)
resp_flg, n (%) 0 1211 (68.2) 1056 (70.7) 155 (54.8)
1 565 (31.8) 437 (29.3) 128 (45.2)

Of the 984 subjects receiving IAC, 170 (17.2 %) died within 28 days, whereas 113 of 792 (14.2 %) died in the no-IAC group. In a univariate analysis we can assess if the lower rate of mortality is statistically significant, by fitting a single covariate aline_flg logistic regression.

To do this, we have to pay attention to our feature types. The target of the GLM model has to be numerical (https://en.wikipedia.org/wiki/Generalized_linear_model), so we have to convert the day_28_flg to a numerical value.

ep.ad.replace_feature_types(edata, ["day_28_flg"], "numeric")
glm_model = ep.tl.glm(
    edata,
    formula="day_28_flg ~ aline_flg",
    var_names=["day_28_flg", "aline_flg"],
    family="Binomial",
    use_feature_types=True,
)
res = glm_model.fit()
res.summary()
Generalized Linear Model Regression Results
Dep. Variable: day_28_flg No. Observations: 1776
Model: GLM Df Residuals: 1774
Model Family: Binomial Df Model: 1
Link Function: Logit Scale: 1.0000
Method: IRLS Log-Likelihood: -777.43
Date: Fri, 05 Sep 2025 Deviance: 1554.9
Time: 17:16:58 Pearson chi2: 1.78e+03
No. Iterations: 4 Pseudo R-squ. (CS): 0.001680
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
Intercept -1.7932 0.102 -17.650 0.000 -1.992 -1.594
aline_flg[T.1] 0.2271 0.132 1.720 0.085 -0.032 0.486

Indeed, the p-value for aline_flg is about 0.09. There are likely several important covariates that differed among those who received IAC and those who did not. These may serve as confounders, and the possible association we observed in the univariate analysis may be stronger, nonexistent or in the opposite direction (i.e., IAC having lower rates of mortality) depending on the situation. Our next step would be to adjust for these confounders. A common approach is to fit all univariate models (one covariate at a time, as we did with aline_flg, but separately for each covariate and without aline_flg), and perform a hypothesis test on each model. Any variables which had statistical significance under the univariate models would then be included in a multivariable model. Another approach begins with the model we just fit (which only has aline_flg as a covariate), and then sequentially adds variables one at a time. This approach is often called stepwise forward selection.

In our stepwise backwards elimination procedure, we are going to fit a full model and eliminate one variable at a time, until we are left with a model with only statistically significant covariates.

Because aline_flg is the covariate of interest, it will remain in the model regardless of its statistical significance. At each step, we need to define a criterion for eliminating variables. One approach is to perform a hypothesis test for each covariate and remove the covariate with the largest p-value, unless all p-values are < 0.05 or the largest p-value corresponds to aline_flg.

Most covariates are binary or categorical, and we’ve already converted them to factors. The disease severity scores (SAPS and SOFA) are continuous. Adding them directly assumes a linear trend in mortality odds as scores change, which may not be appropriate. To address this, we can discretize the scores into quantiles using cut2.

We will apply this function to discretize the sapsi_first and sofa_first scores into quantiles. These discretized scores can then be added to the survival model to better capture nonlinear relationships with mortality risk.

def cut2(edata: ed.EHRData, column_name: str, bins: int = 5):
    """Cut a continuous variable into bins and add it as a new categorical variable to the EHRData object."""
    created_col_name = f"{column_name}_{bins}_bins"
    if created_col_name in edata.obs.columns:
        return edata
    df = ed.io.to_pandas(edata)
    x = df[column_name]

    # Cut the data into bins
    bin_labels = [f"{column_name}_bin_{i + 1}" for i in range(bins)]
    binned = pd.qcut(x, q=bins, labels=bin_labels)
    binned.name = created_col_name

    df = pd.concat([df, binned], axis=1)

    new_obs = edata.obs.copy()
    new_obs = pd.concat([new_obs, binned], axis=1)

    new_var = edata.var.copy()
    new_var.loc[created_col_name] = "categorical"

    new_edata = ed.EHRData(
        X=df.values,
        obs=new_obs,
        var=new_var,
    )

    return new_edata
cut2_edata = cut2(edata, "sapsi_first")
cut2_edata = cut2(cut2_edata, "sofa_first")
cut2_edata.var.index
Index(['aline_flg', 'icu_los_day', 'hospital_los_day', 'age', 'gender_num',
       'weight_first', 'bmi', 'sapsi_first', 'sofa_first', 'service_unit',
       'service_num', 'day_icu_intime', 'day_icu_intime_num',
       'hour_icu_intime', 'hosp_exp_flg', 'icu_exp_flg', 'day_28_flg',
       'mort_day_censored', 'censor_flg', 'sepsis_flg', 'chf_flg', 'afib_flg',
       'renal_flg', 'liver_flg', 'copd_flg', 'cad_flg', 'stroke_flg',
       'mal_flg', 'resp_flg', 'map_1st', 'hr_1st', 'temp_1st', 'spo2_1st',
       'abg_count', 'wbc_first', 'hgb_first', 'platelet_first', 'sodium_first',
       'potassium_first', 'tco2_first', 'chloride_first', 'bun_first',
       'creatinine_first', 'po2_first', 'pco2_first', 'iv_day_1',
       'sapsi_first_5_bins', 'sofa_first_5_bins'],
      dtype='object')

We added a new column in the EHRData object for each of the sapsi_first and sofa_first bins. Now every observation has a value for sapsi_first_bin and sofa_first_bin which is the quartile that the observation falls into. We can now add these to the model, instead of the continuous sapsi_first and sofa_first scores.

dependent_var = "day_28_flg"
not_dependent_vars = [
    "aline_flg",
    "age",
    "gender_num",
    "sapsi_first_5_bins",
    "sofa_first_5_bins",
    "service_unit",
    "chf_flg",
    "afib_flg",
    "renal_flg",
    "liver_flg",
    "copd_flg",
    "cad_flg",
    "stroke_flg",
    "mal_flg",
    "resp_flg",
]
formula = f"{dependent_var} ~ {' + '.join(not_dependent_vars)}"
var_names = not_dependent_vars + [dependent_var]
full_model = ep.tl.glm(
    cut2_edata,
    var_names,
    formula,
    missing="drop",
    family="Binomial",
    use_feature_types=True,
)
full_model_result = full_model.fit()
full_model_result.summary()
Generalized Linear Model Regression Results
Dep. Variable: day_28_flg No. Observations: 1684
Model: GLM Df Residuals: 1661
Model Family: Binomial Df Model: 22
Link Function: Logit Scale: 1.0000
Method: IRLS Log-Likelihood: -479.06
Date: Fri, 05 Sep 2025 Deviance: 958.12
Time: 17:16:58 Pearson chi2: 1.35e+03
No. Iterations: 7 Pseudo R-squ. (CS): 0.2311
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
Intercept -7.4832 0.849 -8.809 0.000 -9.148 -5.818
aline_flg[T.1] 0.0327 0.204 0.160 0.873 -0.367 0.433
gender_num[T.1.0] 0.1422 0.172 0.824 0.410 -0.196 0.480
sapsi_first_5_bins[T.sapsi_first_bin_2] 0.0390 0.342 0.114 0.909 -0.631 0.709
sapsi_first_5_bins[T.sapsi_first_bin_3] 0.6417 0.293 2.190 0.028 0.067 1.216
sapsi_first_5_bins[T.sapsi_first_bin_4] 0.5174 0.293 1.769 0.077 -0.056 1.091
sapsi_first_5_bins[T.sapsi_first_bin_5] 1.3695 0.295 4.637 0.000 0.791 1.948
sofa_first_5_bins[T.sofa_first_bin_2] 0.5223 0.302 1.729 0.084 -0.070 1.114
sofa_first_5_bins[T.sofa_first_bin_3] 0.6462 0.301 2.144 0.032 0.055 1.237
sofa_first_5_bins[T.sofa_first_bin_4] 0.7943 0.290 2.735 0.006 0.225 1.363
sofa_first_5_bins[T.sofa_first_bin_5] 0.8867 0.336 2.639 0.008 0.228 1.545
service_unit[T.MICU] 1.0609 0.681 1.559 0.119 -0.273 2.395
service_unit[T.SICU] 0.6666 0.674 0.990 0.322 -0.654 1.987
chf_flg[T.1] 0.2347 0.234 1.005 0.315 -0.223 0.693
afib_flg[T.1] 0.5544 0.211 2.628 0.009 0.141 0.968
renal_flg[T.1] -0.7834 0.409 -1.917 0.055 -1.584 0.018
liver_flg[T.1] 0.4643 0.341 1.360 0.174 -0.205 1.133
copd_flg[T.1] 0.2533 0.244 1.037 0.300 -0.225 0.732
cad_flg[T.1] -0.2444 0.288 -0.849 0.396 -0.808 0.319
stroke_flg[T.1] 2.0563 0.220 9.348 0.000 1.625 2.487
mal_flg[T.1] 0.4879 0.209 2.334 0.020 0.078 0.898
resp_flg[T.1] 0.7172 0.192 3.735 0.000 0.341 1.094
age 0.0425 0.006 6.841 0.000 0.030 0.055

Some of the covariates are statistically, while others may be expendable. Ideally, we would like as simple of a model as possible that can explain as much of the variation in the outcome as possible. We will attempt to remove our first covariate by the procedure we outlined above. For each of the variables we consider removing, we could fit a logistic regression model without that covariate, and then test it against the current model.

The drop1 function will take the data, the dependent variable and a list of independent variables and return the p-value for each variable. We can then choose to remove the variable with the largest p-value, this process would then be repeated until all p-values are less than 0.05. As this is an example, we will only do one iteration.

from scipy.stats import chi2


def drop1(
    edata: ed.EHRData,
    dependent_var: str,
    not_dependent_vars: list[str],
    missing: str = "drop",
):
    """Python implementation of R's drop1 function using ehrapy.tl.glm.

    Args:
        edata: The EHRData object for the OLS model.
        dependent_var: The name of the dependent variable.
        independent_vars: A list of independent variables.
        missing: How to handle missing data. Default is "drop".

    Returns:
        pd.DataFrame: A DataFrame containing AIC, BIC, and LRT p-values for each model with one term dropped.
    """
    # Fit the full model
    formula = f"{dependent_var} ~ {' + '.join(not_dependent_vars)}"
    var_names = not_dependent_vars + [dependent_var]
    full_model = ep.tl.glm(
        edata,
        var_names=var_names,
        formula=formula,
        missing=missing,
        family="Binomial",
        use_feature_types=True,
    )
    full_model_result = full_model.fit()

    results = []

    # Drop one term at a time and compare models
    for term in not_dependent_vars:
        reduced_formula = (
            f"{formula.split('~')[0].strip()} ~ {' + '.join([t for t in not_dependent_vars if t != term])}"
        )
        reduced_model = ep.tl.glm(
            edata,
            var_names=var_names,
            formula=reduced_formula,
            missing=missing,
            family="Binomial",
            use_feature_types=True,
        )
        reduced_model_result = reduced_model.fit()

        df = full_model.df_model - reduced_model.df_model
        aic = reduced_model_result.aic
        deviance = reduced_model_result.deviance

        lrt_stat = -2 * (reduced_model_result.llf - full_model_result.llf)
        lrt_pvalue = chi2.sf(lrt_stat, df=df)  # p-value using chi-squared distribution

        results.append(
            {
                "Dropped Term": term,
                "Deviance": deviance,
                "AIC": aic,
                "LRT Statistic": lrt_stat,
                "LRT p-value": lrt_pvalue,
            }
        )

    results_df = pd.DataFrame(results)
    return results_df
drop1(cut2_edata, dependent_var, not_dependent_vars, missing="drop")
Dropped Term Deviance AIC LRT Statistic LRT p-value
0 aline_flg 958.146491 1002.146491 0.025627 8.728142e-01
1 age 1010.700508 1054.700508 52.579644 4.131493e-13
2 gender_num 958.832055 1002.832055 0.711191 3.990487e-01
3 sapsi_first_5_bins 1069.795683 1107.795683 111.674819 3.197094e-23
4 sofa_first_5_bins 978.050947 1016.050947 19.930083 5.155226e-04
5 service_unit 963.069661 1005.069661 4.948797 8.421362e-02
6 chf_flg 959.122961 1003.122961 1.002098 3.168034e-01
7 afib_flg 964.939191 1008.939191 6.818327 9.022704e-03
8 renal_flg 962.106642 1006.106642 3.985778 4.588591e-02
9 liver_flg 959.886794 1003.886794 1.765931 1.838865e-01
10 copd_flg 959.180901 1003.180901 1.060038 3.032066e-01
11 cad_flg 958.857519 1002.857519 0.736655 3.907345e-01
12 stroke_flg 1049.584017 1093.584017 91.463154 1.136895e-21
13 mal_flg 963.410697 1007.410697 5.289834 2.145027e-02
14 resp_flg 972.184704 1016.184704 14.063840 1.767086e-04

aline_flg has the largest p-value, but we stipulated in our model selection plan that we would retain this covariate as it’s our covariate of interest. We will then go to the next largest p-value which is the gender_num variable. From here we can update our model, and repeat the backwards elimination step on the updated model.

reduced_vars = [ind_var for ind_var in not_dependent_vars if ind_var != "gender_num"]
drop1(cut2_edata, dependent_var, reduced_vars, missing="drop")
Dropped Term Deviance AIC LRT Statistic LRT p-value
0 aline_flg 958.853545 1000.853545 0.021491 8.834499e-01
1 age 1010.759880 1052.759880 51.927825 5.757865e-13
2 sapsi_first_5_bins 1070.187660 1106.187660 111.355606 3.739818e-23
3 sofa_first_5_bins 979.692198 1015.692198 20.860144 3.375408e-04
4 service_unit 963.599162 1003.599162 4.767108 9.222224e-02
5 chf_flg 959.775853 1001.775853 0.943799 3.313029e-01
6 afib_flg 965.669262 1007.669262 6.837207 8.927832e-03
7 renal_flg 962.629266 1004.629266 3.797212 5.133801e-02
8 liver_flg 960.558143 1002.558143 1.726089 1.889112e-01
9 copd_flg 959.933978 1001.933978 1.101923 2.938444e-01
10 cad_flg 959.490514 1001.490514 0.658459 4.171044e-01
11 stroke_flg 1049.620719 1091.620719 90.788664 1.598666e-21
12 mal_flg 964.027821 1006.027821 5.195767 2.264197e-02
13 resp_flg 972.926783 1014.926783 14.094729 1.738302e-04

Conclusion#

We see that aline_flg still has the largest p-value, but cad_flg has the second largest, so we would choose to remove it next.