Note
This page was generated from ehrapy_introduction.ipynb. Some tutorial content may look better in light mode.
Introduction to ehrapy¶
[1]:
from IPython.display import Image
Welcome to ehrapy!
ehrapy is a framework for the exploratory and targeted end-to-end analysis of complex electronic health record (EHR) datasets inspired by the biological omics world. Hereby, data points are not necessarily treated as complete patients, but as patient visits representing snapshots of the underlying system. The goal of any exploratory analysis not necessarily is to predict or classify a specific state, but to understand the system underlying the data manifold.
ehrapy is not a pure machine learning library or a pure statistics library, but a framework providing simplified access to fundamental algorithms to preprocess, visualize and analyze EHR data.
Fundamental Principles¶
One of the main advantages of ehrapy is that EHR datasets can be analyzed from beginning to end with a clear, but flexible, order of operations.
[2]:
Image(filename="images/ehrapy_overview.png", width=800)
[2]:
ehrapy borrows a lot from the single-cell world and the scverse ecosystem. Notably, ehrapy is using the same data structure (AnnData) and many of the fundamental algorithms (scanpy). Both are briefly introduced in the following subsections.
AnnData¶
AnnData is short for Annotated Data and is the primary data structure used within ehrapy. Technically described, it is a Python package for handling annotated data matrices in memory and on disk, positioned between Pandas and xarray. AnnData offers a broad range of computationally efficient features including, among others, sparse data support, lazy operations, and a PyTorch interface. From a users perspective, it is based on the idea of a primary 2D matrix X
of, for example, dimensions
n_patient_visits x n_features
. The patient visits would then also be the observations (obs
) and the features would be the variables (var
). AnnData allows us to annotate this matrix either with respect to the observations or the variables. Furthermore, AnnData allows for the addition of graph like structures (obsp, varp
) and further structured (obsm, varm
) and unstructured matrices (uns
) to be saved within the same object. These can than be readily used for various machine
learning algorithms.
Visualized it looks like this:
[3]:
Image(filename="images/anndata_schema.jpg", width=800)
[3]:
Let us create an example AnnData object as it would be used in ehrapy.
[4]:
import anndata as ad
import pandas as pd
import numpy as np
After importing the required packages, we create an example dataset with a patient_visit_id column and some feature columns such as age, b12_level and d3_level. We further add a service_unit column that we do not want to include as data for our algorithms, but only as annotations.
[5]:
data = {
"patient_visit_id": [0, 1, 2],
"age": [59, 24, 64],
"b12_level": [560, 201, 450],
"d3_level": [25, 19, 50],
"service_unit": ["NY", "NY", "BO"],
}
df = pd.DataFrame(data)
[6]:
df
[6]:
patient_visit_id | age | b12_level | d3_level | service_unit | |
---|---|---|---|---|---|
0 | 0 | 59 | 560 | 25 | NY |
1 | 1 | 24 | 201 | 19 | NY |
2 | 2 | 64 | 450 | 50 | BO |
Next, we import ehrapy and create an AnnData object using this Pandas DataFrame. Usually, EHR data comes in the form of csv/tsv
tables that can be directly read into ehrapy using ep.io.read_csv()
. For the sake of this example we transform an existing Pandas DataFrame into an AnnData object using the df_to_anndata
function. Note that it has a index_column
parameter to set the index and a columns_obs_only
parameter which denotes features which should not be a part of the X
matrix but of obs
annotations. This will allow us to e.g. color plots by service_unit
, but not to use these values for algorithms.
[7]:
import ehrapy as ep
[8]:
adata = ep.ad.df_to_anndata(
df, index_column="patient_visit_id", columns_obs_only=["service_unit"]
)
/home/zeth/PycharmProjects/ehrapy/ehrapy/anndata/anndata_ext.py:108: DeprecationWarning: Converting `np.inexact` or `np.floating` to a dtype is deprecated. The current result is `float64` which is not strictly correct.
X = X.astype(np.number) if all_num else X.astype(object)
[9]:
adata
[9]:
AnnData object with n_obs × n_vars = 3 × 3
obs: 'service_unit'
var: 'ehrapy_column_type'
layers: 'original'
When examining our AnnData object we notice that it has a matrix of size 3 x 3 which correspond to our age, B12 and D3 measurements.
[10]:
adata.obs
[10]:
service_unit | |
---|---|
patient_visit_id | |
0 | NY |
1 | NY |
2 | BO |
Furthermore, our obs
has the service unit as expected. The AnnData object also has data in the uns
(unstructured) slot that denotes which columns are numerical columns and which ones are not. This may be required for specific algorithms.
[11]:
adata.uns
[11]:
Finally, the layers
slot of our object saves all original values before any modifications in original
. When using ehrapy, the X
matrix will constantly be modified when applying algorithms to the object (e.g. scaling). This layer is a copy of our original X
which will allow us to e.g. scale the age, but use the original values when coloring a UMAP plot.
[12]:
adata.layers["original"]
[12]:
array([[ 59., 560., 25.],
[ 24., 201., 19.],
[ 64., 450., 50.]])
For more details please examine the AnnData documentation and the AnnData paper.
scanpy¶
scanpy is a framework for the analysis of single-cell data and ehrapy heavily builds upon it. While some of the implemented algorithms are single-cell specific (e.g. the highly_variable_genes function), many can be applied to any data (e.g. PCA or UMAP). ehrapy may also implement equivalents of single-cell specific functions that are EHR specific (e.g. the highly_variable_features function). All useful scanpy functions are wrapped in ehrapy to ensure that they are easily accessible and implemented in a fast and scalable way.
[13]:
Image(filename="images/scanpy.jpg", width=800)
[13]:
Just like scanpy, ehrapy follows the same API patterns of preprocessing (pp), tools (tl) and plots (pl). Hence, the various functions from scanpy like scanpy.tl.umap
can be used from ehrapy in a similar fashion: ep.tl.umap
.
The documentation of ehrapy tries to hide as many details from the single-cell world as possible, but you may see the terms cell, gene or expression pop up somewhere. However, the tight integration of AnnData and scanpy into ehrapy also allows for the joint analysis of omics data and EHR data. We will provide a vignette for this in the future.
To learn more about scanpy please read the scanpy documentation and the scanpy paper.
ehrapy¶
Now that we’ve covered the basics of AnnData and scanpy and we have an example dataset, we can apply some of ehrapy’s tools on it. We will start by calculating and visualizing a PCA on our data.
[14]:
ep.pp.pca(adata)
[15]:
ep.pl.pca(adata, color="service_unit")
This is of course not a useful analysis since we only have three visits.
Making your dataset ready for ehrapy¶
Data types¶
ehrapy requires data to be in two dimensional, vectorized format meaning anything that could be stored in a single Pandas DataFrame is suitable. It does not matter whether the data originally came from a database or several CSV files.
[16]:
# This is NOT okay
data = {
"MixedColumn1": ["Apple", 10, "Banana", 20],
"MixedColumn2": [15, "Cherry", 5, "Date"],
}
df = pd.DataFrame(data)
df
[16]:
MixedColumn1 | MixedColumn2 | |
---|---|---|
0 | Apple | 15 |
1 | 10 | Cherry |
2 | Banana | 5 |
3 | 20 | Date |
[17]:
# This is okay
data = {
"Column1": ["Apple", "Banana", "Cherry", "Date"],
"Column2": [10, 20, 15, 5],
"Column3": [True, False, True, False],
"Column4": [
pd.Timestamp("2023-08-01"),
pd.Timestamp("2023-08-15"),
pd.Timestamp("2023-08-10"),
pd.Timestamp("2023-08-05"),
],
"Column5": pd.Categorical(["dead", "alive", "dead", "dead"]),
}
df = pd.DataFrame(data)
df
[17]:
Column1 | Column2 | Column3 | Column4 | Column5 | |
---|---|---|---|---|---|
0 | Apple | 10 | True | 2023-08-01 | dead |
1 | Banana | 20 | False | 2023-08-15 | alive |
2 | Cherry | 15 | True | 2023-08-10 | dead |
3 | Date | 5 | False | 2023-08-05 | dead |
[18]:
adata = ep.ad.df_to_anndata(df)
adata
[18]:
AnnData object with n_obs × n_vars = 4 × 5
var: 'ehrapy_column_type'
layers: 'original'
Feature groups¶
For many analyses with ehrapy it is useful to group together features that belong to the same data modality. Examples are high level groups such as demography values, lab or vital sign measurements. This allows for simpler groupbys or the creation of subsets:
[19]:
data = {
"gender": pd.Categorical(["male", "female", "female", "male"]),
"age": [10, 20, 15, 5],
"b12": [300, 600, 800, 500],
"d3": [25, 30, 28, 21],
}
df = pd.DataFrame(data)
df
[19]:
gender | age | b12 | d3 | |
---|---|---|---|---|
0 | male | 10 | 300 | 25 |
1 | female | 20 | 600 | 30 |
2 | female | 15 | 800 | 28 |
3 | male | 5 | 500 | 21 |
[20]:
adata = ep.ad.df_to_anndata(df)
adata.var
[20]:
ehrapy_column_type | |
---|---|
gender | non_numeric |
age | numeric |
b12 | numeric |
d3 | numeric |
[21]:
demographics_features = ["age", "gender"]
lab_measurements_features = ["b12", "d3"]
# Assign the measurement groups to features in .var
measurement_group = []
for feature in adata.var_names:
if feature in demographics_features:
measurement_group.append("demographics")
elif feature in lab_measurements_features:
measurement_group.append("lab_measurements")
adata.var["measurement_group"] = measurement_group
[22]:
adata_demographics = adata[:, adata.var["measurement_group"] == "demographics"]
adata_demographics
[22]:
View of AnnData object with n_obs × n_vars = 4 × 2
var: 'ehrapy_column_type', 'measurement_group'
layers: 'original'
Units¶
EHR measurements are recorded in specific units that are ideally stored with the measurements:
[23]:
data = {
"gender [categorical]": pd.Categorical(["male", "female", "female", "male"]),
"age [years]": [10, 20, 15, 5],
"b12 [pg/mL]": [300, 600, 800, 500],
"d3 [ng/mL]": [25, 30, 28, 21],
}
df = pd.DataFrame(data)
df
[23]:
gender [categorical] | age [years] | b12 [pg/mL] | d3 [ng/mL] | |
---|---|---|---|---|
0 | male | 10 | 300 | 25 |
1 | female | 20 | 600 | 30 |
2 | female | 15 | 800 | 28 |
3 | male | 5 | 500 | 21 |
[24]:
adata = ep.ad.df_to_anndata(df)
# Extract feature names and units from var_names and store separately
feature_names = [var_name.split("[")[0].strip() for var_name in adata.var_names]
unit_annotations = [
var_name.split("[")[-1][:-1] if "[" in var_name else ""
for var_name in adata.var_names
]
# Update .var with feature names and units separately
adata.var_names = feature_names
adata.var["units"] = unit_annotations
[25]:
adata.var
[25]:
ehrapy_column_type | units | |
---|---|---|
gender | non_numeric | categorical |
age | numeric | years |
b12 | numeric | pg/mL |
d3 | numeric | ng/mL |
[26]:
d3_unit = adata.var["units"]["d3"]
print(f"Unit of 'd3': {d3_unit}")
Unit of 'd3': ng/mL
Conclusion¶
To get started check out the MIMIC-II introduction tutorial where you will learn to apply ehrapy to a real dataset to investigate the effect of intdwelling artherical catheters on patient survival over multiple notebooks.
Please also consider consulting the ehrapy API documentation.