Note

This page was generated from mimic_2_fate.ipynb. Some tutorial content may look better in light mode.

MIMIC-II IAC Patient Fate

In the previous introduction tutorial, we explored the MIMIC-II IAC dataset, comprising electronic health records (EHR) of 1776 patients in 46 features, and identified patient group-specific clusters using ehrapy. Please go through the MIMIC-II IAC introduction before performing this tutorial to get familiar with the dataset.

As a next step, we want to determine patient fate. The goal is to detect terminal states and the corresponding origins based on pseudotime. Real time very rarely reflects the actual progression of a disease. When measurements are done on a certain day, some patients will show no sign of disease (e.g. healthy or recovered), some are at the onset of a specific disease and some are in a more severe stage or even at the height. For an appropriate analysis, we are interested in a continuous transition of states, such as from healthy to diseased to death, for which the real time is therefore not informative. Identification of transition states can be achieved by identifying source states (e.g. healthy) and then calculating pseudotime from this state. Based on Markov chain modelling, we uncover patient dynamics using CellRank. For more details, please read CellRank paper 1 and CellRank paper 2.

In this tutorial we will be using CellRank to:

  1. Simulate patient trajectories with random walks.

  2. Compute patient macrostates and infer fate probabilities towards predicted terminal states.

  3. Identify potential driver features for each identified trajectory.

  4. Visualize feature trends along specific patient states, while accounting for the continuous nature of fate determination.

Before performing this tutorial, we highly recommend to read the extensive and well written CellRank documentation, especially the general tutorial chapter is useful. If you are not familiar with single-cell data, do not be afraid and replace cells with patients visits and genes with features in your mind.

This tutorial requires cellrank to be installed. As this packages is not a dependency of ehrapy, it must be installed separately.

[1]:
%%capture --no-display
!pip install cellrank

Before we start with the patient fate analysis of the MIMIC-II IAC dataset, we set up our environment including the import of packages and preparation of the dataset.


Environment setup

Ensure that the latest version of ehrapy is installed. A list of all dependency versions can be found at the end of this tutorial.

[2]:
import ehrapy as ep
import cellrank as cr
import numpy as np

We are ignoring a few warnings for readability reasons.

[3]:
import warnings

warnings.filterwarnings("ignore")

Getting and preprocessing the MIMIC-II dataset

This tutorial is based on the MIMIC-II IAC dataset which was previously introduced in the MIMIC-II IAC introduction tutorial. We will load the encoded version of the dataset as an AnnData object, ehrapy’s default encoding is a simple one-hot encoding in this case.

[4]:
adata = ep.dt.mimic_2(encoded=False)
adata
[4]:
AnnData object with n_obs × n_vars = 1776 × 46
    var: 'ehrapy_column_type'
    layers: 'original'

The MIMIC-II dataset has 1776 patients with 46 features. Now that we have our AnnData object ready, we need to perform the standard preprocessing steps as performed in the introduction tutorial again before we can use ehrapy and CellRank for patient fate analysis.

[5]:
ep.ad.infer_feature_types(adata)
2024-04-25 21:47:36,455 - root INFO - Stored feature types in adata.var['feature_type']. Please verify and adjust if necessary using adata.var['feature_type']['feature1']='corrected_type'.
 Detected feature types for AnnData object with 1776 obs and 46 vars
╠══ 📅 Date features
╠══ 📐 Numerical features
║   ╠══ abg_count
║   ╠══ age
║   ╠══ bmi
║   ╠══ bun_first
║   ╠══ chloride_first
║   ╠══ creatinine_first
║   ╠══ hgb_first
║   ╠══ hospital_los_day
║   ╠══ hr_1st
║   ╠══ icu_los_day
║   ╠══ iv_day_1
║   ╠══ map_1st
║   ╠══ mort_day_censored
║   ╠══ pco2_first
║   ╠══ platelet_first
║   ╠══ po2_first
║   ╠══ potassium_first
║   ╠══ sapsi_first
║   ╠══ sodium_first
║   ╠══ sofa_first
║   ╠══ spo2_1st
║   ╠══ tco2_first
║   ╠══ temp_1st
║   ╠══ wbc_first
║   ╚══ weight_first
╚══ 🗂️ Categorical features
    ╠══ afib_flg (2 categories)
    ╠══ aline_flg (2 categories)
    ╠══ cad_flg (2 categories)
    ╠══ censor_flg (2 categories)
    ╠══ chf_flg (2 categories)
    ╠══ copd_flg (2 categories)
    ╠══ day_28_flg (2 categories)
    ╠══ day_icu_intime (7 categories)
    ╠══ day_icu_intime_num (7 categories)
    ╠══ gender_num (2 categories)
    ╠══ hosp_exp_flg (2 categories)
    ╠══ hour_icu_intime (24 categories)
    ╠══ icu_exp_flg (2 categories)
    ╠══ liver_flg (2 categories)
    ╠══ mal_flg (2 categories)
    ╠══ renal_flg (2 categories)
    ╠══ resp_flg (2 categories)
    ╠══ sepsis_flg (1 categories)
    ╠══ service_num (2 categories)
    ╠══ service_unit (3 categories)
    ╚══ stroke_flg (2 categories)
[6]:
%%capture
adata = ep.pp.encode(adata, autodetect=True)
ep.pp.knn_impute(adata, n_neighbours=5)
ep.pp.log_norm(adata, vars=["iv_day_1", "po2_first"], offset=1)
ep.pp.pca(adata)
ep.pp.neighbors(adata, n_pcs=10)
ep.tl.umap(adata)
ep.tl.leiden(adata, resolution=0.3, key_added="leiden_0_3")
2024-04-25 21:47:36,541 - root INFO - The original categorical values `['service_unit', 'day_icu_intime']` were added to uns.
2024-04-25 21:47:36,589 - root INFO - Updated the original layer after encoding.
2024-04-25 21:47:36,606 - root INFO - The original categorical values `['service_unit', 'day_icu_intime']` were added to obs.
[7]:
ep.settings.set_figure_params(figsize=(5, 4), dpi=100)
ep.pl.umap(adata, color=["leiden_0_3"], title="Leiden 0.3", size=20)
../../_images/tutorials_notebooks_mimic_2_fate_18_0.png

This UMAP embedding is exactly the same as previously computed in the MIMIC-II IA introduction tutorial. Now we continue with the patient fate analysis.


Analysis using ehrapy and CellRank

Depending on the data it may not always be possible to clearly define a cluster or specific patient visits as the origin or terminus of a trajectory. Working with single-cell data simplifies matters since the detection of stem cells generally signifies the start of cell differentiation.

In this tutorial, we will define a patient cluster as the origin (root cluster) and explore possible terminal states.

Pseudotime calculation

As the root cluster for pseudotime calculation we choose cluster 0 since patients in that cluster do not show very severe comorbidities and features yet. Then we calculate the Diffusion Pseudotime with the el.dpt() function.

[8]:
adata.uns["iroot"] = np.flatnonzero(adata.obs["leiden_0_3"] == "0")[0]
ep.tl.dpt(adata)
WARNING: Trying to run `tl.dpt` without prior call of `tl.diffmap`. Falling back to `tl.diffmap` with default parameters.

Now we define the kernel, compute the transition matrix and plot a projection onto the UMAP.

Determining patient fate with a PseudotimeKernel

The PseudotimeKernel computes direct transition probabilities based on a KNN graph and pseudotime.

The KNN graph contains information about the (undirected) conductivities among observations (here patients), reflecting their similarity. Pseudotime can be used to either remove edges that point against the direction of increasing pseudotime, or to downweight them.

[9]:
from cellrank.kernels import PseudotimeKernel

pk = PseudotimeKernel(adata, time_key="dpt_pseudotime")
pk.compute_transition_matrix()
[9]:
PseudotimeKernel[n=1776, dnorm=False, scheme='hard', frac_to_keep=0.3]
[10]:
ep.settings.set_figure_params(figsize=(5, 4), dpi=100)
pk.plot_projection(basis="umap", color="leiden_0_3")
../../_images/tutorials_notebooks_mimic_2_fate_30_0.png

We observe two main trajectories originating from cluster 0 going to clusters 2 and 5. Let’s check the metadata again.

[11]:
ep.pl.umap(
    adata, color="censor_flg", title="Censored or Death (0 = death, 1 = censored)"
)
ep.pl.umap(
    adata,
    color="mort_day_censored",
    title="Day post ICU admission of censoring or death",
)
../../_images/tutorials_notebooks_mimic_2_fate_32_0.png
../../_images/tutorials_notebooks_mimic_2_fate_32_1.png

Cluster 2 consists of patients that deceased and had severe comorbidities while cluster 5 includes patients with a high day post ICU admission.

Simulating transitions with random walks

Cellrank makes it easy to simulate the behavior of random walks from specific clusters. This allows us to not only visualize where the patients end up, but also roughly how many in which clusters after a defined number of iterations. We can either just start walking…

[12]:
pk.plot_random_walks(
    seed=0,
    n_sims=100,
    start_ixs={"leiden_0_3": ["0"]},
    legend_loc="right",
    dpi=100,
    show_progress_bar=False,
)
../../_images/tutorials_notebooks_mimic_2_fate_36_0.png

… or set a number of required hits in one or more terminal clusters. Here, we require 50 hits in cluster 2 or 5.

[13]:
pk.plot_random_walks(
    seed=0,
    n_sims=100,
    start_ixs={"leiden_0_3": ["0"]},
    stop_ixs={"leiden_0_3": ["2", "5"]},
    successive_hits=50,
    legend_loc="right",
    dpi=100,
    show_progress_bar=False,
)
../../_images/tutorials_notebooks_mimic_2_fate_38_0.png

Black and yellow dots indicate random walk start and terminal patient visits, respectively.

Determining macrostates and terminal states

To find the terminal states of cluster 0, well will use an estimator to predict the patient fates using the above calculated transition matrix. The main objective is to decompose the patient state space into a set of macrostates, that represent the slow-time scale dynamics of the process and predict terminal states. Here, we will use an Generalized Perron Cluster Cluster Analysis (GPCCA) estimator.

As a first step we try to identify macrostates in the data using the fit() function.

[14]:
g = cr.estimators.GPCCA(pk)
g.fit(cluster_key="leiden_0_3")
g.macrostates_memberships
WARNING: Unable to import `petsc4py` or `slepc4py`. Using `method='brandts'`
WARNING: For `method='brandts'`, dense matrix is required. Densifying
[14]:
1_11_25
0.2588380.7410590.000103
0.0133800.5399140.446706
0.0262560.2844580.689286
0.0685090.9314750.000016
0.0182310.9528860.028882
0.0019330.8884930.109574
0.0095560.9394960.050948
0.0586560.9193730.021971
0.0130900.5370580.449851
0.0185960.9814030.000001
.........
0.0756310.9234670.000902
0.0207130.9792850.000002
0.0048250.1326610.862514
0.0283270.1393970.832277
0.1353100.8646390.000051
0.0029950.8955690.101435
0.0995120.9004520.000036
0.2524660.7474320.000102
0.0123530.9876210.000026
0.2505940.7435540.005853

1776 cells x 3 lineages

[15]:
g.predict_terminal_states()
g.plot_macrostates(which="terminal")
../../_images/tutorials_notebooks_mimic_2_fate_43_0.png
[16]:
g.plot_macrostates(which="terminal", same_plot=False)
../../_images/tutorials_notebooks_mimic_2_fate_44_0.png

As a next step we will calculate the fate probabilities. For each patient visit, this computes the probability of being absorbed in any of the terminal states by aggregating over all random walks that start in a given patient visit and end in some terminal population.

[17]:
g.compute_fate_probabilities(preconditioner="ilu", tol=1e-15)
g.plot_fate_probabilities()
WARNING: Unable to import petsc4py. For installation, please refer to: https://petsc4py.readthedocs.io/en/stable/install.html.
Defaulting to `'gmres'` solver.
WARNING: `3` solution(s) did not converge
../../_images/tutorials_notebooks_mimic_2_fate_46_3.png

The plot above combines fate probabilities towards all terminal states, each patient visit is colored according to its most likely fate, color intensity reflects the degree of fate priming.

[18]:
g.plot_fate_probabilities(same_plot=False)
../../_images/tutorials_notebooks_mimic_2_fate_48_0.png

We can also visualize the fate probabilities jointly in a circular projection where each dot represents a patient visit, colored by cluster labels. Patient visits are arranged inside the circle according to their fate probabilities, fate biased visits are placed next to their corresponding corner while undetermined patient fates are placed in the middle.

[19]:
cr.pl.circular_projection(adata, keys="leiden_0_3", legend_loc="right")
../../_images/tutorials_notebooks_mimic_2_fate_50_0.png

Identification of driver features

We uncover putative driver features by correlating fate probabilities with features using the compute_lineage_drivers() method. In other words, if a feature is systematically higher or lower in patient visits that are more or less likely to differentiate towards a given terminal state, respectively, then we call this feature a putative driver feature.

We calculate these driver features for our three lineages 1_1, 1_2 and 5.

[20]:
%%capture
ep.settings.set_figure_params(figsize=(3, 3), dpi=100)
[21]:
drivers_1_1 = g.compute_lineage_drivers(lineages="1_1")
adata.obs["fate_probs_1_1"] = g.fate_probabilities["1_1"].X.flatten()

ep.pl.umap(
    adata,
    color=["fate_probs_1_1"] + list(drivers_1_1.index[:8]),
    color_map="viridis",
    s=50,
    ncols=3,
    vmax="p96",
)
../../_images/tutorials_notebooks_mimic_2_fate_54_0.png
[22]:
drivers_1_2 = g.compute_lineage_drivers(lineages="1_2")
adata.obs["fate_probs_1_2"] = g.fate_probabilities["1_2"].X.flatten()

ep.pl.umap(
    adata,
    color=["fate_probs_1_2"] + list(drivers_1_2.index[:8]),
    color_map="viridis",
    s=50,
    ncols=3,
    vmax="p96",
)
../../_images/tutorials_notebooks_mimic_2_fate_55_0.png
[23]:
drivers_5 = g.compute_lineage_drivers(lineages="5")
adata.obs["fate_probs_5"] = g.fate_probabilities["5"].X.flatten()

ep.pl.umap(
    adata,
    color=["fate_probs_5"] + list(drivers_5.index[:8]),
    color_map="viridis",
    s=50,
    ncols=3,
    vmax="p96",
)
../../_images/tutorials_notebooks_mimic_2_fate_56_0.png

The lineage 1_1 seems to have a lot of patients that deceased in hospital, are of high age and had a high platelet measurement, while lineage 1_2 consists of patients that deceased in hospital, had a high first SAPS I and SOFA score and lineage 5 consists of patients with a high number of days after ICU release.

Conclusion

In this tutorial we applied CellRank and ehrapy to identify patient visit trajectories from selected root clusters, computed macrostates of clusters and pointed out features that are driving those trajectories. Following that, we visualized feature trends across the pseudotime for the patient trajectories. In particular, we inspected trajectories of patients originating from cluster 0, which was defined by less severe features, and identified 3 major trajectories. Two trajectories (2_1 and 2_2) were terminated in a bad outcome cluster and were driven by severity features such as age, death and comorbidities.

As a next tutorial, we suggest to have a closer look at our survival analysis, continue with that tutorials or go back to our tutorial overview page.


References

  • Raffa, J. (2016). Clinical data from the MIMIC-II database for a case study on indwelling arterial catheters (version 1.0). PhysioNet. https://doi.org/10.13026/C2NC7F.

  • Raffa J.D., Ghassemi M., Naumann T., Feng M., Hsu D. (2016) Data Analysis. In: Secondary Analysis of Electronic Health Records. Springer, Cham

  • Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., … & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

  • Marius Lange, Volker Bergen, Michal Klein, Manu Setty, Bernhard Reuter, Mostafa Bakhti, Heiko Lickert, Meshal Ansari, Janine Schniering, Herbert B. Schiller, Dana Pe’er, and Fabian J. Theis. Cellrank for directed single-cell fate mapping. Nat. Methods, 2022. doi:10.1038/s41592-021-01346-6.

  • Lars Velten, Simon F. Haas, Simon Raffel, Sandra Blaszkiewicz, Saiful Islam, Bianca P. Hennig, Christoph Hirche, Christoph Lutz, Eike C. Buss, Daniel Nowak, Tobias Boch, Wolf-Karsten Hofmann, Anthony D. Ho, Wolfgang Huber, Andreas Trumpp, Marieke A. G. Essers, and Lars M. Steinmetz. Human haematopoietic stem cell lineage commitment is a continuous process. Nature Cell Biology, 19(4):271–281, 2017. doi:10.1038/ncb3493.

  • Bergen, V., Lange, M., Peidli, S. et al. Generalizing RNA velocity to transient cell states through dynamical modeling. Nat Biotechnol 38, 1408–1414 (2020). https://doi.org/10.1038/s41587-020-0591-3

  • Haghverdi, L., Büttner, M., Wolf, F. et al. Diffusion pseudotime robustly reconstructs lineage branching. Nat Methods 13, 845–848 (2016). https://doi.org/10.1038/nmeth.3971


Package versions

[26]:
ep.print_versions()
-----
ehrapy              0.7.0
rich                NA
session_info        1.0.0
-----
Cython                      3.0.6
PIL                         10.3.0
absl                        NA
aiohttp                     3.9.1
aiosignal                   1.3.1
anndata                     0.10.3
annotated_types             0.6.0
anyio                       NA
arrow                       1.3.0
astor                       0.8.1
asttokens                   NA
attr                        23.1.0
attrs                       23.1.0
autograd                    NA
autograd_gamma              NA
babel                       2.14.0
backoff                     2.2.1
bs4                         4.12.2
cachetools                  5.3.2
category_encoders           2.6.3
causallearn                 NA
cellrank                    2.0.4
certifi                     2023.11.17
cffi                        1.16.0
chardet                     5.2.0
charset_normalizer          3.3.2
chex                        0.1.86
click                       8.1.7
comm                        0.2.1
contextlib2                 NA
croniter                    NA
cvxopt                      1.3.2
cycler                      0.12.1
cython                      3.0.6
cython_runtime              NA
dateutil                    2.8.2
db_dtypes                   1.2.0
debugpy                     1.8.0
decorator                   5.1.1
deep_translator             1.9.1
deepdiff                    6.7.1
deepl                       1.16.1
defusedxml                  0.7.1
dill                        0.3.7
docrep                      0.3.2
dot_parser                  NA
dowhy                       0.11
etils                       1.6.0
executing                   2.0.1
faiss                       1.8.0
fastapi                     0.105.0
fastjsonschema              NA
fhiry                       3.2.2
fknni                       1.1.0
flax                        0.8.2
formulaic                   0.6.6
fqdn                        NA
frozenlist                  1.4.1
fsspec                      2023.12.2
future                      0.18.3
google                      NA
graphviz                    0.20.1
h5py                        3.10.0
idna                        3.6
igraph                      0.10.8
imblearn                    0.12.2
importlib_resources         NA
interface_meta              1.3.0
ipykernel                   6.28.0
ipywidgets                  8.1.2
isoduration                 NA
jax                         0.4.26
jaxlib                      0.4.26
jedi                        0.19.1
jinja2                      3.0.3
joblib                      1.4.0
json5                       0.9.24
jsonpointer                 2.4
jsonschema                  4.21.1
jsonschema_specifications   NA
jupyter_events              0.10.0
jupyter_server              2.13.0
jupyterlab_server           2.26.0
kiwisolver                  1.4.5
legacy_api_wrap             NA
leidenalg                   0.10.1
lifelines                   0.27.8
lightning                   2.0.9.post0
lightning_cloud             0.5.57
lightning_utilities         0.10.0
llvmlite                    0.41.1
markupsafe                  2.1.3
matplotlib                  3.8.4
matplotlib_inline           0.1.6
missingno                   0.5.2
ml_collections              NA
ml_dtypes                   0.3.1
mpl_toolkits                NA
mpmath                      1.3.0
msgpack                     1.0.7
mudata                      0.2.3
multidict                   6.0.4
multipart                   0.0.6
multipledispatch            0.6.0
natsort                     8.4.0
nbformat                    5.10.4
networkx                    3.2.1
numba                       0.58.1
numpy                       1.26.4
numpyro                     0.13.2
opt_einsum                  v3.3.0
optax                       0.1.7
ordered_set                 4.1.0
overrides                   NA
packaging                   23.2
pandas                      2.2.2
parso                       0.8.3
patsy                       0.5.4
pexpect                     4.9.0
pickleshare                 0.7.5
pkg_resources               NA
platformdirs                4.1.0
progressbar                 4.3.2
prometheus_client           NA
prompt_toolkit              3.0.43
psutil                      5.9.7
ptyprocess                  0.7.0
pure_eval                   0.2.2
pyarrow                     14.0.2
pyasn1                      0.5.1
pyasn1_modules              0.3.0
pycparser                   2.22
pydantic                    2.1.1
pydantic_core               2.4.0
pydev_ipython               NA
pydevconsole                NA
pydevd                      2.9.5
pydevd_file_utils           NA
pydevd_plugins              NA
pydevd_tracing              NA
pydot                       1.4.2
pygam                       0.8.0
pygments                    2.17.2
pygpcca                     1.0.4
pynndescent                 0.5.11
pyparsing                   3.1.2
pyro                        1.8.6
python_utils                NA
pythonjsonlogger            NA
pytz                        2024.1
rapidfuzz                   3.5.2
referencing                 NA
requests                    2.31.0
rfc3339_validator           0.1.4
rfc3986_validator           0.1.1
rpds                        NA
rsa                         4.9
scanpy                      1.10.1
scipy                       1.13.0
scvelo                      0.3.2
scvi                        1.1.2
seaborn                     0.13.2
send2trash                  NA
setuptools                  68.2.2
six                         1.16.0
sklearn                     1.4.2
sniffio                     1.3.0
soupsieve                   2.5
sparse                      0.14.0
stack_data                  0.6.3
starlette                   0.27.0
statsmodels                 0.14.1
swig_runtime_data4          NA
sympy                       1.12
tableone                    0.8.0
tabulate                    0.9.0
texttable                   1.7.0
thefuzz                     0.20.0
threadpoolctl               3.4.0
toolz                       0.12.0
torch                       2.1.2+cu121
torchgen                    NA
torchmetrics                1.2.1
tornado                     6.4
tqdm                        4.66.1
traitlets                   5.14.1
typing_extensions           NA
umap                        0.5.5
uri_template                NA
urllib3                     2.0.7
uvicorn                     0.24.0.post1
vscode                      NA
wcwidth                     0.2.13
webcolors                   1.13
websocket                   1.7.0
websockets                  12.0
wrapt                       1.16.0
yaml                        6.0.1
yarl                        1.9.4
zmq                         25.1.2
-----
IPython             8.20.0
jupyter_client      8.6.0
jupyter_core        5.7.1
jupyterlab          4.1.6
notebook            7.1.2
-----
Python 3.11.7 | packaged by conda-forge | (main, Dec 15 2023, 08:38:37) [GCC 12.3.0]
Linux-6.8.7-arch1-1-x86_64-with-glibc2.39
-----
Session information updated at 2024-04-25 21:59