Note

This page was generated from mimic_2_introduction.ipynb. Some tutorial content may look better in light mode.

MIMIC-II IAC Introduction¶

This tutorial explores the Multiparameter Intelligent Monitoring in Intensive Care II (MIMIC-II) Indwelling Arterial Catheters (IAC) dataset, as subset derived from MIMIC-II, the publicly-accessible critical care database. The database was created for the purpose of a case study in the book “Secondary Analysis of Electronic Health Records”, published by Springer in 2016. In particular, the MIMIC-II IAC dataset was used throughout Chapter 16 (Data Analysis) by Raffa J. et al. to investigate the effectiveness of indwelling arterial catheters in hemodynamically stable patients with respiratory failure for mortality outcomes.

More details on the dataset such as all included features and their description can be found here.

[56]:

from IPython.display import Image

Image(filename="images/MIMIC-II-database-structure.png", width=400)

[56]:

../../_images/tutorials_notebooks_mimic_2_introduction_2_0.png

In this tutorial we want to explore the MIMIC-II IAC dataset using ehrapy to identify patient groups and their associated features.

The major steps of an analysis with ehrapy include:

Preprocessing and quality control (QC)
Dimensionality reduction
Batch effect identification
Clustering
Additional downstream analysis

Before we start with the analysis of the MIMIC-II IAC dataset, we set up our environment including the import of packages and preparation of the dataset.

Environment setup¶

Ensure that the latest version of ehrapy is installed. A list of all dependency versions can be found at the end of this tutorial.

[57]:

import warnings
warnings.filterwarnings("ignore")

import ehrapy as ep
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import rcParams

MIMIC-II IAC dataset loading¶

ehrapy offers several datasets in AnnData format that can be used out of the box. In this tutorial we will use the MIMIC-II IAC dataset with unencoded features. ehrapy’s default encoding is a simple one-hot encoding in this case. More details on encoding can be seen in the next step.

[58]:

adata = ep.dt.mimic_2(encoded=False)
adata

❗ test warning
💡 test info

[58]:

AnnData object with n_obs × n_vars = 1776 × 46
    var: 'ehrapy_column_type'
    layers: 'original'

The MIMIC-II dataset has 1776 patients with 46 features. Now that we have our AnnData file ready, we can start the analysis using ehrapy and the first step will be to preprocess the dataset.

Analysis using ehrapy¶

Preprocessing¶

The dataset contains 46 features, as previously mentioned. We distinguish between numerical, categorical, and date features. First, we need to clarify which features fall into what category. Ehrapy offers the method infer_feature_types for this, which guesses the type of each feature. It is important to always check that the type guesses are correct, and if not, to correct them.

[59]:

ep.ad.infer_feature_types(adata)

💡 Stored feature types in adata.var['feature_type']. Please verify and adjust if necessary using adata.var['feature_type']['feature1']='corrected_type'.

 Detected feature types for AnnData object with 1776 obs and 46 vars
╠══ 📅 Date features
╠══ 📐 Numerical features
║   ╠══ abg_count
║   ╠══ age
║   ╠══ bmi
║   ╠══ bun_first
║   ╠══ chloride_first
║   ╠══ creatinine_first
║   ╠══ hgb_first
║   ╠══ hospital_los_day
║   ╠══ hr_1st
║   ╠══ icu_los_day
║   ╠══ iv_day_1
║   ╠══ map_1st
║   ╠══ mort_day_censored
║   ╠══ pco2_first
║   ╠══ platelet_first
║   ╠══ po2_first
║   ╠══ potassium_first
║   ╠══ sapsi_first
║   ╠══ sodium_first
║   ╠══ sofa_first
║   ╠══ spo2_1st
║   ╠══ tco2_first
║   ╠══ temp_1st
║   ╠══ wbc_first
║   ╚══ weight_first
╚══ 🗂️ Categorical features
    ╠══ afib_flg (2 categories)
    ╠══ aline_flg (2 categories)
    ╠══ cad_flg (2 categories)
    ╠══ censor_flg (2 categories)
    ╠══ chf_flg (2 categories)
    ╠══ copd_flg (2 categories)
    ╠══ day_28_flg (2 categories)
    ╠══ day_icu_intime (7 categories)
    ╠══ day_icu_intime_num (7 categories)
    ╠══ gender_num (2 categories)
    ╠══ hosp_exp_flg (2 categories)
    ╠══ hour_icu_intime (24 categories)
    ╠══ icu_exp_flg (2 categories)
    ╠══ liver_flg (2 categories)
    ╠══ mal_flg (2 categories)
    ╠══ renal_flg (2 categories)
    ╠══ resp_flg (2 categories)
    ╠══ sepsis_flg (1 categories)
    ╠══ service_num (2 categories)
    ╠══ service_unit (3 categories)
    ╚══ stroke_flg (2 categories)

When checking the detected feature types above, we find that for this dataset, all features were detected correctly. If you find that for your dataset this isn’t the case, you can correct the feature type using adata.var['feature_type']['feature1'] = 'correct_type'.

Categorical features could either already be stored numerically (e.g., as 0/1 for flags) or as another type such as strings. Such categorical features need an encoding. We first identify the categorical features stored non-numerically in our dataset:

[60]:

adata.var[adata.var["ehrapy_column_type"] == "non_numeric"]

[60]:

	ehrapy_column_type	feature_type
service_unit	non_numeric	categorical
day_icu_intime	non_numeric	categorical

We identified service_unit and day_icu_intime as categorical features stored non-numerically. We will therefore encode them first with one-hot encoding. This ensures that no ordering is preserved for the respective features. ehrapy also offers other encoding functions.

[61]:

adata = ep.pp.encode(adata, encodings={"one-hot": ["service_unit", "day_icu_intime"]})

💡 Updated the original layer after encoding.
💡 The original categorical values `['service_unit', 'day_icu_intime']` were added to obs.

[62]:

adata

[62]:

AnnData object with n_obs × n_vars = 1776 × 54
    obs: 'service_unit', 'day_icu_intime'
    var: 'ehrapy_column_type', 'feature_type'
    uns: 'encoding_to_var', 'original_values_categoricals', 'var_to_encoding'
    layers: 'original'

After one-hot encoding the two features, we have expanded our matrix from 46 to 54 features.

To verify that we have encoded all features correctly and are ready to proceed, we can use the type_overview() function.

[63]:

ep.ad.type_overview(adata)

Variable names for AnnData object with 1776 obs and 54 vars
╠══ 🔐 Encoded variables
║   ╠══ service_unit -> 3 categories; one-hot encoded; original data type: string
║   ╚══ day_icu_intime -> 7 categories; one-hot encoded; original data type: string
╚══ 🔓 Unencoded variables
    ╠══ aline_flg -> data type: floating
    ╠══ icu_los_day -> data type: floating
    ╠══ hospital_los_day -> data type: floating
    ╠══ age -> data type: floating
    ╠══ gender_num -> data type: floating
    ╠══ weight_first -> data type: floating
    ╠══ bmi -> data type: floating
    ╠══ sapsi_first -> data type: floating
    ╠══ sofa_first -> data type: floating
    ╠══ service_num -> data type: floating
    ╠══ day_icu_intime_num -> data type: floating
    ╠══ hour_icu_intime -> data type: floating
    ╠══ hosp_exp_flg -> data type: floating
    ╠══ icu_exp_flg -> data type: floating
    ╠══ day_28_flg -> data type: floating
    ╠══ mort_day_censored -> data type: floating
    ╠══ censor_flg -> data type: floating
    ╠══ sepsis_flg -> data type: floating
    ╠══ chf_flg -> data type: floating
    ╠══ afib_flg -> data type: floating
    ╠══ renal_flg -> data type: floating
    ╠══ liver_flg -> data type: floating
    ╠══ copd_flg -> data type: floating
    ╠══ cad_flg -> data type: floating
    ╠══ stroke_flg -> data type: floating
    ╠══ mal_flg -> data type: floating
    ╠══ resp_flg -> data type: floating
    ╠══ map_1st -> data type: floating
    ╠══ hr_1st -> data type: floating
    ╠══ temp_1st -> data type: floating
    ╠══ spo2_1st -> data type: floating
    ╠══ abg_count -> data type: floating
    ╠══ wbc_first -> data type: floating
    ╠══ hgb_first -> data type: floating
    ╠══ platelet_first -> data type: floating
    ╠══ sodium_first -> data type: floating
    ╠══ potassium_first -> data type: floating
    ╠══ tco2_first -> data type: floating
    ╠══ chloride_first -> data type: floating
    ╠══ bun_first -> data type: floating
    ╠══ creatinine_first -> data type: floating
    ╠══ po2_first -> data type: floating
    ╠══ pco2_first -> data type: floating
    ╚══ iv_day_1 -> data type: floating

Quality Control (QC)¶

Demographics distribution¶

To see if we have strong differences by demographics, we can check these features in a violin plot.

[64]:

ep.settings.set_figure_params(figsize=(4, 3), dpi=100)
ep.pl.violin(adata, keys=["age"], groupby="service_unit")

../../_images/tutorials_notebooks_mimic_2_introduction_25_0.png

Missing values¶

ehrapy’s pp.qc_metrics() function will calculate several useful metrics such as the absolute number and percentages of missing values and properties like the mean/median/min/max of all features. The percentage of missing values is important as features with too many missing values should not be included.

[65]:

obs_metric, var_metrics = ep.pp.qc_metrics(adata)

[66]:

obs_metric

[66]:

	missing_values_abs	missing_values_pct
0	0	0.000000
1	12	22.222222
2	0	0.000000
3	3	5.555556
4	0	0.000000
...	...	...
1771	1	1.851852
1772	1	1.851852
1773	3	5.555556
1774	1	1.851852
1775	1	1.851852

1776 rows × 2 columns

[67]:

var_metrics

[67]:

	missing_values_abs	missing_values_pct	mean	median	standard_deviation	min	max	iqr_outliers
ehrapycat_service_unit_FICU	0	0.000000	NaN	NaN	NaN	NaN	NaN	False
ehrapycat_service_unit_MICU	0	0.000000	NaN	NaN	NaN	NaN	NaN	False
ehrapycat_service_unit_SICU	0	0.000000	NaN	NaN	NaN	NaN	NaN	False
ehrapycat_day_icu_intime_Friday	0	0.000000	NaN	NaN	NaN	NaN	NaN	False
ehrapycat_day_icu_intime_Monday	0	0.000000	NaN	NaN	NaN	NaN	NaN	False
ehrapycat_day_icu_intime_Saturday	0	0.000000	NaN	NaN	NaN	NaN	NaN	False
ehrapycat_day_icu_intime_Sunday	0	0.000000	NaN	NaN	NaN	NaN	NaN	False
ehrapycat_day_icu_intime_Thursday	0	0.000000	NaN	NaN	NaN	NaN	NaN	False
ehrapycat_day_icu_intime_Tuesday	0	0.000000	NaN	NaN	NaN	NaN	NaN	False
ehrapycat_day_icu_intime_Wednesday	0	0.000000	NaN	NaN	NaN	NaN	NaN	False
aline_flg	0	0.000000	0.554054	1.000000	0.497070	0.000000	1.000000	False
icu_los_day	0	0.000000	3.346498	2.185000	3.355316	0.500000	28.240000	True
hospital_los_day	0	0.000000	8.110923	6.000000	8.154862	1.000000	112.000000	True
age	0	0.000000	54.379660	53.678585	21.056924	15.180230	99.110950	False
gender_num	1	0.056306	0.577465	1.000000	0.493963	0.000000	1.000000	False
weight_first	110	6.193694	80.075948	77.000000	22.483765	30.000000	257.600000	True
bmi	466	26.238739	27.827316	26.324846	8.206940	12.784877	98.797134	True
sapsi_first	85	4.786036	14.136606	14.000000	4.113085	3.000000	32.000000	True
sofa_first	6	0.337838	5.820904	6.000000	2.334006	0.000000	17.000000	True
service_num	0	0.000000	0.552928	1.000000	0.497191	0.000000	1.000000	False
day_icu_intime_num	0	0.000000	4.054054	4.000000	1.993911	1.000000	7.000000	False
hour_icu_intime	0	0.000000	10.585586	9.000000	7.922733	0.000000	23.000000	False
hosp_exp_flg	0	0.000000	0.137387	0.000000	0.344256	0.000000	1.000000	True
icu_exp_flg	0	0.000000	0.095721	0.000000	0.294208	0.000000	1.000000	True
day_28_flg	0	0.000000	0.159347	0.000000	0.365999	0.000000	1.000000	True
mort_day_censored	0	0.000000	614.329825	731.000000	402.996047	0.000000	3094.080000	True
censor_flg	0	0.000000	0.720158	1.000000	0.448922	0.000000	1.000000	False
sepsis_flg	0	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	False
chf_flg	0	0.000000	0.119932	0.000000	0.324883	0.000000	1.000000	True
afib_flg	0	0.000000	0.116554	0.000000	0.320888	0.000000	1.000000	True
renal_flg	0	0.000000	0.033784	0.000000	0.180672	0.000000	1.000000	True
liver_flg	0	0.000000	0.055743	0.000000	0.229425	0.000000	1.000000	True
copd_flg	0	0.000000	0.088401	0.000000	0.283877	0.000000	1.000000	True
cad_flg	0	0.000000	0.069257	0.000000	0.253890	0.000000	1.000000	True
stroke_flg	0	0.000000	0.125000	0.000000	0.330719	0.000000	1.000000	True
mal_flg	0	0.000000	0.144144	0.000000	0.351236	0.000000	1.000000	True
resp_flg	0	0.000000	0.318131	0.000000	0.465751	0.000000	1.000000	False
map_1st	0	0.000000	88.246998	87.000000	17.590711	5.000000	195.000000	True
hr_1st	0	0.000000	87.914977	87.000000	18.753561	30.000000	158.000000	True
temp_1st	3	0.168919	97.792194	98.099998	4.539520	32.000000	104.800003	True
spo2_1st	0	0.000000	98.432995	100.000000	5.510842	4.000000	100.000000	True
abg_count	0	0.000000	5.984797	3.000000	8.681962	0.000000	115.000000	True
wbc_first	8	0.450450	12.320396	11.300000	6.597979	0.170000	109.800000	True
hgb_first	8	0.450450	12.551584	12.700000	2.200953	2.000000	19.000000	True
platelet_first	8	0.450450	246.083145	239.000000	99.837223	7.000000	988.000000	True
sodium_first	5	0.281532	139.559006	140.000000	4.724875	105.000000	165.000000	True
potassium_first	5	0.281532	4.107623	4.000000	0.794499	1.900000	9.800000	True
tco2_first	5	0.281532	24.416657	24.000000	4.990763	2.000000	62.000000	True
chloride_first	5	0.281532	103.839074	104.000000	5.732664	78.000000	133.000000	True
bun_first	5	0.281532	19.277809	15.000000	14.362833	2.000000	139.000000	True
creatinine_first	6	0.337838	1.095706	0.900000	1.083171	0.000000	18.300000	True
po2_first	186	10.472973	227.623270	195.000000	144.817841	22.000000	634.000000	False
pco2_first	186	10.472973	43.413836	41.000000	13.976388	8.000000	158.000000	True
iv_day_1	143	8.051802	1622.907946	1081.529175	1676.615567	0.000000	13910.000000	True

All properties will be added to the respective layers. Categorical features can be found in the obs layer, while numerical features are in the var layer of the AnnData object. When inspecting both layers, we see that our QC properties were added for each feature if possible.

[68]:

adata.obs.head(4)

[68]:

	service_unit	day_icu_intime	missing_values_abs	missing_values_pct
0	SICU	Friday	0	0.000000
1	MICU	Saturday	12	22.222222
2	MICU	Friday	0	0.000000
3	SICU	Saturday	3	5.555556

[69]:

adata.var.tail(4)

[69]:

	ehrapy_column_type	feature_type	missing_values_abs	missing_values_pct	mean	median	standard_deviation	min	max	iqr_outliers
creatinine_first	numeric	numeric	6	0.337838	1.095706	0.900000	1.083171	0.0	18.3	True
po2_first	numeric	numeric	186	10.472973	227.623270	195.000000	144.817841	22.0	634.0	False
pco2_first	numeric	numeric	186	10.472973	43.413836	41.000000	13.976388	8.0	158.0	True
iv_day_1	numeric	numeric	143	8.051802	1622.907946	1081.529175	1676.615567	0.0	13910.0	True

We can also represent the missing values in a histogram for both obs and var features.

[70]:

axd = plt.figure(constrained_layout=True, figsize=(8, 4), dpi=100).subplot_mosaic(
    """
    AB
    """
)

sns.histplot(
    adata.obs["missing_values_pct"], ax=axd["A"], bins=30, color="#54C285"
).set(title="pct of missing values: obs")
sns.histplot(
    adata.var["missing_values_pct"], ax=axd["B"], bins=30, color="#1FA6C9"
).set(title="pct of missing values: var")

[70]:

[Text(0.5, 1.0, 'pct of missing values: var')]

../../_images/tutorials_notebooks_mimic_2_introduction_35_1.png

We can also check which features have the highest percentage of missing values in both obs and vars.

[71]:

adata.obs.loc[
    adata.obs["missing_values_pct"] == adata.obs["missing_values_pct"].max(), :
]

[71]:

	service_unit	day_icu_intime	missing_values_abs	missing_values_pct
1732	SICU	Thursday	14	25.925926
1751	MICU	Tuesday	14	25.925926

[72]:

adata.var.loc[
    adata.var["missing_values_pct"] == adata.var["missing_values_pct"].max(), :
]

[72]:

	ehrapy_column_type	feature_type	missing_values_abs	missing_values_pct	mean	median	standard_deviation	min	max	iqr_outliers
bmi	numeric	numeric	466	26.238739	27.827316	26.324846	8.20694	12.784877	98.797134	True

Overall, the percentage of missing values in all features is rather low, however, still some features are not complete.

Features with missing values can introduce a bias in the data, making the processing and analysis challenging. To prevent loss of information due to dropping of multiple features, we can fill up the missing values by performing an imputation. Here, we infer the missing values based on the exisitng part of the data.

To perform this efficiently, we suggest to drop features if the percentage of missing values is very high (>60%). In our data, there is no need to drop any feature, since none exceeds more than 27% missing values (BMI, vars).

Missing data imputation¶

ehrapy offers many options to impute missing values in an AnnData object. Here, we use KNN imputation with 5 neighbors (n_neighbours=5, the default value). The KNN algorithm uses proximity to predict the missing values of a feature by finding the k closest neighbours to the missing value and then imputing the missing value based on the non-missing values in the neighbourhood.

[73]:

ep.pp.knn_impute(adata, n_neighbours=5)

After recalcuating the QC metrices, we can check again the percentage of missing values.

[74]:

_ = ep.pp.qc_metrics(adata)

[75]:

axd = plt.figure(constrained_layout=True, figsize=(8, 3), dpi=100).subplot_mosaic(
    """
    AB
    """
)

sns.histplot(adata.obs["missing_values_pct"], ax=axd["A"], bins=5, color="#54C285").set(
    title="pct of missing values: obs", xlim=(0, 30)
)
sns.histplot(adata.var["missing_values_pct"], ax=axd["B"], bins=5, color="#1FA6C9").set(
    title="pct of missing values: var", xlim=(0, 30)
)

[75]:

[Text(0.5, 1.0, 'pct of missing values: var'), (0.0, 30.0)]

../../_images/tutorials_notebooks_mimic_2_introduction_45_1.png

Data distribution¶

Depending on the measurement and the unit of a measurement the value ranges of features may be huge. Clusterings and differential comparisons especially may be greatly influenced by exceptionally big values.

[76]:

axd = plt.figure(constrained_layout=True, figsize=(8, 3), dpi=100).subplot_mosaic(
    """
    AB
    """
)

sns.histplot(adata.var["min"], ax=axd["A"], bins=30, color="#54C285").set(
    title="minimum values"
)
sns.histplot(adata.var["max"], ax=axd["B"], bins=30, color="#1FA6C9").set(
    title="maximum values"
)

[76]:

[Text(0.5, 1.0, 'maximum values')]

../../_images/tutorials_notebooks_mimic_2_introduction_48_1.png

Moreover, features which have a very high coefficient of variation can strongly influence dimensionality reduction. However, since the coefficient of variation performs weak with features that have small means, we only select those which have no small mean.

[77]:

adata.var["coefficient.variation"] = (
    adata.var["standard_deviation"] / adata.var["mean"]
) * 100

[78]:

adata.var.loc[(adata.var["coefficient.variation"] > 50) & (adata.var["mean"] > 50),]

[78]:

	ehrapy_column_type	feature_type	mean	median	standard_deviation	min	max	iqr_outliers	coefficient.variation
mort_day_censored	numeric	numeric	614.329825	731.000000	402.996046	0.0	3094.080078	True	65.599297
po2_first	numeric	numeric	227.499346	200.040001	138.524999	22.0	634.000000	True	60.890285
iv_day_1	numeric	numeric	1571.128036	1027.599976	1624.982216	0.0	13910.000000	True	103.427740

The standard deviations and coefficients of variation of the features iv_day_1 (input fluids by IV on day 1 in mL) and po2_first (first PaO_2 in mmHg) are very high with strong spread between minimum and maximum values. These features require normalization.

Normalization¶

ehrapy offers several options to normalize data. While it is possible to normalize all numerical values at once with the same normalization function, normalizing only the features with high spread, here iv_day_1 and po2_first, can be sufficient. Log normalization with an offset of 1 to add pseudocounts seems appropriate.

Note: When features with negative values should be normalized you have to use the pp.offset_negative_values() function prior normalization.

[79]:

ep.pp.log_norm(adata, vars=["iv_day_1", "po2_first"], offset=1)

[79]:

AnnData object with n_obs × n_vars = 1776 × 54
    obs: 'service_unit', 'day_icu_intime', 'missing_values_abs', 'missing_values_pct'
    var: 'ehrapy_column_type', 'feature_type', 'missing_values_abs', 'missing_values_pct', 'mean', 'median', 'standard_deviation', 'min', 'max', 'iqr_outliers', 'coefficient.variation'
    uns: 'encoding_to_var', 'original_values_categoricals', 'var_to_encoding', 'service_unit_colors', 'normalization'
    layers: 'original', 'raw_norm'

after normalization we can calculate the QC metrices again and check the distribution.

[80]:

_ = ep.pp.qc_metrics(adata)

[81]:

adata.var["coefficient.variation"] = (
    adata.var["standard_deviation"] / adata.var["mean"]
) * 100
adata.var.loc[(adata.var["coefficient.variation"] > 50) & (adata.var["mean"] > 50),]

[81]:

	ehrapy_column_type	feature_type	missing_values_abs	missing_values_pct	mean	median	standard_deviation	min	max	iqr_outliers	coefficient.variation
mort_day_censored	numeric	numeric	0	0.0	614.329825	731.0	402.996046	0.0	3094.080078	True	65.599297

The strong spread of iv_day_1 and po2_first was succesfully removed. Now that we normalized the influence of these features, we can continue with dimensionality reduction.

Dimensionality reduction reduces the number of features (dimensions) by projecting the data to a lower dimensional latent space retaining as much information as possible. This is very useful for high dimensional data, since it reduces complexity and facilitates visualization.

Dimensionality reduction¶

Principle Component Analysis (PCA)¶

As a next step, we reduce the dimensionality of the dataset with principal component analysis (PCA). We can also visualize the principal components with ehrapy using the components argument.

[82]:

ep.pp.pca(adata)

[83]:

ep.pl.pca(adata, color="service_unit", components=["1,2", "3,4"])
ep.pl.pca(adata, color="service_unit", components=["5,6", "7,8"])

../../_images/tutorials_notebooks_mimic_2_introduction_64_0.png

../../_images/tutorials_notebooks_mimic_2_introduction_64_1.png

To inspect certain PCs further, we can inspect the PC dimensionality loadings which highlight the features that contribute strongest to the selected PC.

[84]:

ep.pl.pca_loadings(adata, components="1, 2")
ep.pl.pca_loadings(adata, components="3, 4")

../../_images/tutorials_notebooks_mimic_2_introduction_66_0.png

../../_images/tutorials_notebooks_mimic_2_introduction_66_1.png

Uniform Manifold Approximation and Projection (UMAP)¶

The reduced representation can then be used as input for the neighbours graph calculation which serves as the input for advanced embeddings and visualizations like Uniform Manifold Approximation and Projection (UMAP)

[85]:

ep.pp.neighbors(adata, n_pcs=10)

[86]:

ep.tl.umap(adata)

Checking for Batch effects¶

Before exploring the data further, we need to see if we have a batch effect. A batch effect can e.g. arise from different collection units or collection days. To check if our data contains a batch for those feautures, we visualize the service_unit and the day_icu_intime.

If a batch effect is present, we can use the pp.combat() function to remove effects.

[87]:

ep.settings.set_figure_params(figsize=(6, 5))
ep.pl.umap(
    adata,
    color=[
        "service_unit",
        "day_icu_intime",
    ],
    wspace=0.5,
    size=20,
    title=["Service unit", "Day of ICU admission"],
)

../../_images/tutorials_notebooks_mimic_2_introduction_73_0.png

The embeddings suggest that there’s no strong effect by the aforementioned potential confounders.

Selected features on UMAP¶

Now we can also highlight other relevant features on the UMAP. Interesting features could be demographics, hospital statistics and lab parameters.

Demographics¶

[88]:

ep.pl.umap(
    adata,
    color=["gender_num", "age"],
    wspace=0.5,
    size=20,
    title=["sex (1 = male; 0=female)", "age"],
)

../../_images/tutorials_notebooks_mimic_2_introduction_78_0.png

Hospital statistics¶

[89]:

ep.pl.umap(
    adata,
    color=["icu_los_day", "hosp_exp_flg"],
    wspace=0.5,
    size=20,
    cmap=ep.pl.Colormaps.grey_red.value,
    title=["length of stay in ICU (days)", "death in hospital (1 = yes, 0 = no)"],
)

../../_images/tutorials_notebooks_mimic_2_introduction_80_0.png

Comorbidities¶

[90]:

ep.pl.umap(
    adata,
    color=["liver_flg", "stroke_flg"],
    cmap=ep.pl.Colormaps.grey_red.value,
    title=["Liver disease", "Stroke"],
    ncols=2,
    size=20,
)

../../_images/tutorials_notebooks_mimic_2_introduction_82_0.png

Lab parameters¶

[91]:

ep.pl.umap(
    adata,
    color=["hr_1st", "platelet_first", "po2_first", "pco2_first"],
    wspace=0.5,
    ncols=2,
    size=20,
    title=["Heart Rate", "Platelets (K/u)", "PaO2 (mmHg)", "PaCO2 (mmHg)"],
)

../../_images/tutorials_notebooks_mimic_2_introduction_84_0.png

Cluster analysis¶

To make more sense of the embedding it is often times useful to determine clusters through e.g. community detection as implemented in the Leiden algorithm. Moreover, clustering allows for unbiased detection of features that are changed between clusters and therefore intersting for us.

Cluster identification¶

The implementation in ehrapy allows for the setting of a resolution which determines the number of found clusters. It is often times useful to play around with the parameter.

[92]:

ep.tl.leiden(adata, resolution=0.3, key_added="leiden_0_3")

The leiden algorithm added a key to obs (leiden_0_3) that stores the clusters. These can subsequently be visualized in the UMAP embedding.

[93]:

adata.obs.head(4)

[93]:

	service_unit	day_icu_intime	leiden_0_3
0	SICU	Friday	1
1	MICU	Saturday	0
2	MICU	Friday	3
3	SICU	Saturday	1

[94]:

ep.pl.umap(adata, color=["leiden_0_3"], title="Leiden 0.3", size=20)

../../_images/tutorials_notebooks_mimic_2_introduction_92_0.png

Next, we can explore certain features which are special for certrain clusters and could therefore be used for annotation.

Cluster features¶

To identify cluster-specific markers, ehrapy provides the ep.tl.rank_features_groups() function, which allows statistical tests between the cluster groups to determine significantly enriched or lowered values.

[95]:

ep.tl.rank_features_groups(adata, groupby="leiden_0_3")

[96]:

ep.settings.set_figure_params(figsize=(4, 4), dpi=100)
ep.pl.rank_features_groups(adata, key="rank_features_groups", ncols=2)

../../_images/tutorials_notebooks_mimic_2_introduction_97_0.png

We can also get the top features per cluster as a DataFrame.

[97]:

df = ep.ad.get_rank_features_df(adata, group=["0", "1", "2", "3", "4", "5"])
df = df.loc[(df["logfoldchanges"] > 0) & (df["pvals_adj"] < 0.05),]

E.g. we can check the top marker of cluster 2.

[98]:

df.loc[df["group"] == "2",]

[98]:

	group	names	scores	logfoldchanges	pvals	pvals_adj
109	2	censor_flg	23.874987	0.858043	1.792385e-109	4.839440e-108
113	2	mort_day_censored	12.332548	inf	2.671621e-33	2.404459e-32
114	2	sofa_first	7.506701	1.623888	3.115968e-13	2.403747e-12
116	2	liver_flg	5.300070	2.029637	1.963495e-07	1.178097e-06
118	2	chloride_first	4.767494	2.487876	2.480180e-06	1.217543e-05
119	2	gender_num	4.483568	0.403081	9.012352e-06	4.055558e-05
120	2	weight_first	4.470621	9.405868	9.859004e-06	4.095278e-05
127	2	po2_first	2.615416	0.151288	9.169249e-03	2.475697e-02
128	2	aline_flg	2.580327	0.251710	1.014821e-02	2.609539e-02
129	2	ehrapycat_service_unit_MICU	6.132956	1.000000	1.326848e-02	3.256809e-02

From this table we can also extract the top features in every cluster and highlight those either on the UMAP or as violins plots by cluster.

[99]:

top_features = df.groupby("group").head(5)
top_features = pd.Series(top_features["names"].unique())
top_features

[99]:

0                      censor_flg
1               mort_day_censored
2                       hgb_first
3                  chloride_first
4                    sodium_first
5                      day_28_flg
6                    hosp_exp_flg
7                             age
8                     icu_exp_flg
9                     sapsi_first
10                     sofa_first
11                      liver_flg
12                 platelet_first
13                      wbc_first
14                         hr_1st
15                        mal_flg
16                       copd_flg
17                        chf_flg
18    ehrapycat_service_unit_SICU
19                     tco2_first
dtype: object

[100]:

ep.settings.set_figure_params(figsize=(3.8, 2), dpi=100)
ep.pl.violin(adata, keys=["censor_flg", "mort_day_censored"], groupby="leiden_0_3")
ep.pl.violin(adata, keys=["platelet_first", "age"], groupby="leiden_0_3")
ep.pl.violin(adata, keys=["sapsi_first", "copd_flg"], groupby="leiden_0_3")
ep.pl.violin(adata, keys=["sofa_first", "liver_flg"], groupby="leiden_0_3")

../../_images/tutorials_notebooks_mimic_2_introduction_104_0.png

../../_images/tutorials_notebooks_mimic_2_introduction_104_1.png

../../_images/tutorials_notebooks_mimic_2_introduction_104_2.png

../../_images/tutorials_notebooks_mimic_2_introduction_104_3.png

Cluster annotation¶

With the knowledge of the cluster features, together with the UMAP plots from above we can annotate the clusters.

[101]:

adata.obs["annotation"] = "NA"

[102]:

annotation = {
    "0": "liver+/sofa+",
    "1": "weight+",
    "2": "age+/stroke+/deceased+",
    "3": "platelet+",
    "4": "age+/malignancy+/copd+/deceased+",
    "5": "age+",
}

[103]:

adata.obs["annotation"] = [
    annotation[l] if l in annotation.keys() else l for l in adata.obs["leiden_0_3"]
]

[104]:

ep.settings.set_figure_params(figsize=(4, 3), dpi=100)
ep.pl.umap(
    adata,
    color="annotation",
    size=20,
    palette={
        "weight+": "#007742",
        "platelet+": "#54C285",
        "age+/stroke+/deceased+": "#087A96",
        "age+/malignancy+/copd+/deceased+": "#1FA6C9",
        "age+": "#F4CC47",
        "liver+/sofa+": "#57C8B9",
        "platelet+/heart_rate+": "#ABEC7D",
    },
)

../../_images/tutorials_notebooks_mimic_2_introduction_110_0.png

Additional downstream analysis¶

After these basic ehrapy analysis steps, additional downstream analysis can be performed (see also other tutorials).

PAGA¶

It might also be of interest to infer trajectories to learn about dynamic processes and stage transitions. ehrapy offers several trajectory inference algorithms for this purpose. One of those is partition-based graph abstraction (PAGA).

[105]:

ep.tl.paga(adata, groups="leiden_0_3")

[106]:

ep.pl.paga(
    adata,
    color=["leiden_0_3", "day_28_flg"],
    cmap=ep.pl.Colormaps.grey_red.value,
    title=["Leiden 0.3", "Died in less than 28 days"],
)

../../_images/tutorials_notebooks_mimic_2_introduction_116_0.png

[107]:

ep.tl.umap(adata, init_pos="paga")
ep.pl.umap(adata, color=["annotation"])

../../_images/tutorials_notebooks_mimic_2_introduction_117_0.png

[108]:

ep.tl.draw_graph(adata, init_pos="paga")

WARNING: Package 'fa2' is not installed, falling back to layout 'fr'.To use the faster and better ForceAtlas2 layout, install package 'fa2' (`pip install fa2`).

[109]:

ep.tl.draw_graph(adata, init_pos="paga")
ep.pl.draw_graph(adata, color=["leiden_0_3", "day_28_flg"], legend_loc="on data")

WARNING: Package 'fa2' is not installed, falling back to layout 'fr'.To use the faster and better ForceAtlas2 layout, install package 'fa2' (`pip install fa2`).

../../_images/tutorials_notebooks_mimic_2_introduction_119_1.png

Exporting results¶

We save all of our computations and our final state into an .h5ad file. It can then be read again using the ep.io.read() function, e.g. like this:

[110]:

ep.io.write("mimic_2.h5ad", adata)

Conclusion¶

The MIMIC-II IAC dataset comprises electronic health records (EHR) summarized in 46 features from 1776 individuals. This high dimensional data is not easy to interpret and many interesting and previously unknown features can be overseen when just focusing on selected well-defined features. To overcome this hurdle, we applied ehrapy on the MIMIC-II IAC dataset.

ehrapy is based on the AnnData data structure and scanpy pipeline to allow for efficient analysis. We used the build-in functions to preprocess the data, perform QC with imputation of missing data and reduce the dimensionality, resulting in PCA and UMAP embeddings. After performing all these steps, we explored the data by visualizing multiple features on the UMAP embedding, giving a first glance at the patient structure. To identify patient groups in and unbiased fashion, we clustered our data using the Leiden algorithm resulting in 7 different patient clusters. Calculation of cluster-specific features allowed us to annotate the clusters according to the most prominent markers. We saw a strong difference between patients that deceased, had higher age and severe comorbidities such as a stroke and COPD (clusters 2+3) and those that had milder features such as increased platelets and weight (clusters 0+1). Close to these two clusters were two additional clusters that harbored more severe features such as increased heart rate (cluster 5) and high SOFA score with liver disease (cluster 6), indicating potential patient trajectories. Cluster 4 clustered apart from all the others and consists of patients that deceased several months/years after leaving the ICU.

To explore the patient fate and survival in more detail, continue with our other tutorials or go back to our tutorial overview page.

References¶

Raffa, J. (2016). Clinical data from the MIMIC-II database for a case study on indwelling arterial catheters (version 1.0). PhysioNet. https://doi.org/10.13026/C2NC7F.
Raffa J.D., Ghassemi M., Naumann T., Feng M., Hsu D. (2016) Data Analysis. In: Secondary Analysis of Electronic Health Records. Springer, Cham
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., … & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
McInnes et al., (2018). UMAP: Uniform Manifold Approximation and Projection. Journal of Open Source Software, 3(29), 861, https://doi.org/10.21105/joss.00861
Traag, V.A., Waltman, L. & van Eck, N.J. From Louvain to Leiden: guaranteeing well-connected communities. Sci Rep 9, 5233 (2019). https://doi.org/10.1038/s41598-019-41695-z
Wolf, F.A., Hamey, F.K., Plass, M. et al. PAGA: graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells. Genome Biol 20, 59 (2019). https://doi.org/10.1186/s13059-019-1663-x

Package versions¶

[111]:

ep.print_versions()

-----
ehrapy              0.7.0
rich                NA
scanpy              1.10.1
session_info        1.0.0
-----
PIL                         10.3.0
anndata                     0.10.3
anyio                       NA
arrow                       1.3.0
astor                       0.8.1
asttokens                   NA
attr                        23.1.0
attrs                       23.1.0
autograd                    NA
autograd_gamma              NA
babel                       2.14.0
bs4                         4.12.2
cachetools                  5.3.2
category_encoders           2.6.3
causallearn                 NA
certifi                     2023.11.17
cffi                        1.16.0
chardet                     5.2.0
charset_normalizer          3.3.2
comm                        0.2.1
cvxopt                      1.3.2
cycler                      0.12.1
cython_runtime              NA
dateutil                    2.8.2
db_dtypes                   1.2.0
debugpy                     1.8.0
decorator                   5.1.1
deep_translator             1.9.1
deepl                       1.16.1
defusedxml                  0.7.1
dill                        0.3.7
dot_parser                  NA
dowhy                       0.11
executing                   2.0.1
faiss                       1.8.0
fastjsonschema              NA
fhiry                       3.2.2
fknni                       1.1.0
formulaic                   0.6.6
fqdn                        NA
future                      0.18.3
google                      NA
h5py                        3.10.0
idna                        3.6
igraph                      0.10.8
imblearn                    0.12.2
interface_meta              1.3.0
ipykernel                   6.28.0
ipywidgets                  8.1.2
isoduration                 NA
jedi                        0.19.1
jinja2                      3.0.3
joblib                      1.4.0
json5                       0.9.24
jsonpointer                 2.4
jsonschema                  4.21.1
jsonschema_specifications   NA
jupyter_events              0.10.0
jupyter_server              2.13.0
jupyterlab_server           2.26.0
kiwisolver                  1.4.5
lamin_utils                 0.13.1
legacy_api_wrap             NA
leidenalg                   0.10.1
lifelines                   0.27.8
llvmlite                    0.41.1
markupsafe                  2.1.3
matplotlib                  3.8.4
matplotlib_inline           0.1.6
missingno                   0.5.2
mpl_toolkits                NA
mpmath                      1.3.0
natsort                     8.4.0
nbformat                    5.10.4
networkx                    3.2.1
numba                       0.58.1
numpy                       1.26.4
overrides                   NA
packaging                   23.2
pandas                      2.2.2
parso                       0.8.3
patsy                       0.5.4
pexpect                     4.9.0
pickleshare                 0.7.5
platformdirs                4.1.0
prometheus_client           NA
prompt_toolkit              3.0.43
psutil                      5.9.7
ptyprocess                  0.7.0
pure_eval                   0.2.2
pyarrow                     14.0.2
pyasn1                      0.5.1
pyasn1_modules              0.3.0
pycparser                   2.22
pydev_ipython               NA
pydevconsole                NA
pydevd                      2.9.5
pydevd_file_utils           NA
pydevd_plugins              NA
pydevd_tracing              NA
pydot                       1.4.2
pygments                    2.17.2
pynndescent                 0.5.11
pyparsing                   3.1.2
pythonjsonlogger            NA
pytz                        2024.1
rapidfuzz                   3.5.2
referencing                 NA
requests                    2.31.0
rfc3339_validator           0.1.4
rfc3986_validator           0.1.1
rpds                        NA
rsa                         4.9
scipy                       1.13.0
seaborn                     0.13.2
send2trash                  NA
setuptools                  68.2.2
six                         1.16.0
sklearn                     1.4.2
sniffio                     1.3.0
soupsieve                   2.5
sparse                      0.14.0
stack_data                  0.6.3
statsmodels                 0.14.1
swig_runtime_data4          NA
sympy                       1.12
tableone                    0.8.0
tabulate                    0.9.0
texttable                   1.7.0
thefuzz                     0.20.0
threadpoolctl               3.4.0
torch                       2.1.2+cu121
torchgen                    NA
tornado                     6.4
tqdm                        4.66.1
traitlets                   5.14.1
typing_extensions           NA
umap                        0.5.5
uri_template                NA
urllib3                     2.0.7
vscode                      NA
wcwidth                     0.2.13
webcolors                   1.13
websocket                   1.7.0
wrapt                       1.16.0
yaml                        6.0.1
zmq                         25.1.2
-----
IPython             8.20.0
jupyter_client      8.6.0
jupyter_core        5.7.1
jupyterlab          4.1.6
notebook            7.1.2
-----
Python 3.11.7 | packaged by conda-forge | (main, Dec 15 2023, 08:38:37) [GCC 12.3.0]
Linux-6.8.7-arch1-1-x86_64-with-glibc2.39
-----
Session information updated at 2024-04-25 23:35