Usage#

Import the ehrapy API as follows:

import ehrapy as ep

You can then access the respective modules like:

ep.pl.cool_fancy_plot()

Reading and writing#

io.read_csv

Reads or downloads a desired directory of csv/tsv files or a single csv/tsv file.

io.read_h5ad

Reads or downloads a desired directory of h5ad files or a single h5ad file.

io.read_fhir

Reads one or multiple FHIR files using fhiry.

io.write

Write AnnData objects to file.

Data#

data.mimic_2

Loads the MIMIC-II dataset.

data.mimic_2_preprocessed

Loads the preprocessed MIMIC-II dataset.

data.mimic_3_demo

Loads the MIMIC-III demo dataset as a dictionary of Pandas DataFrames.

data.diabetes_130

Loads the diabetes-130 dataset

data.heart_failure

Loads the heart failure dataset.

data.chronic_kidney_disease

Loads the Chronic Kidney Disease dataset

data.breast_tissue

Loads the Breast Tissue Data Set

data.cervical_cancer_risk_factors

Loads the Cervical cancer (Risk Factors) Data Set

data.dermatology

Loads the Dermatology Data Set

data.echocardiogram

Loads the Echocardiogram Data Set

data.heart_disease

Loads the Heart Disease Data Set

data.hepatitis

Loads the Hepatitis Data Set

data.statlog_heart

Loads the Statlog (Heart) Data Set

data.thyroid

Loads the Thyroid Data Set

data.breast_cancer_coimbra

Loads the Breast Cancer Coimbra Data Set

data.parkinson_dataset_with_replicated_acoustic_features

Loads the Parkinson Dataset with replicated acoustic features Data Set

data.parkinsons

Loads the Parkinsons Data Set

data.parkinsons_disease_classification

Loads the Parkinson's Disease Classification Data Set

data.parkinsons_telemonitoring

Loads the Parkinsons Telemonitoring Data Set

Preprocessing#

Any transformation of the data matrix that is not a tool. Other than tools, preprocessing steps usually don’t return an easily interpretable annotation, but perform a basic transformation on the data matrix.

Basic preprocessing#

preprocessing.pca

Computes a principal component analysis.

preprocessing.regress_out

Regress out (mostly) unwanted sources of variation.

preprocessing.subsample

Subsample to a fraction of the number of observations.

preprocessing.highly_variable_features

Annotate highly variable features.

preprocessing.winsorize

Returns a Winsorized version of the input array.

preprocessing.clip_quantile

Clips (limits) features.

Quality control#

preprocessing.qc_metrics

Calculates various quality control metrics.

preprocessing.qc_lab_measurements

Examines lab measurements for reference ranges and outliers.

Imputation#

preprocessing.explicit_impute

Replaces all missing values in all columns or a subset of columns specified by the user with the passed replacement value.

preprocessing.simple_impute

Impute missing values in numerical data using mean/median/most frequent imputation.

preprocessing.knn_impute

Imputes missing values in the input AnnData object using K-nearest neighbor imputation.

preprocessing.miss_forest_impute

Impute data using the MissForest strategy.

preprocessing.soft_impute

Impute data using the SoftImpute.

preprocessing.iterative_svd_impute

Impute missing values in an AnnData object using the IterativeSVD algorithm.

preprocessing.matrix_factorization_impute

Impute data using the MatrixFactorization.

preprocessing.nuclear_norm_minimization_impute

Impute data using the NuclearNormMinimization.

preprocessing.mice_forest_impute

Impute data using the miceforest.

Encoding#

preprocessing.encode

Encode categoricals of an AnnData object.

preprocessing.undo_encoding

Undo the current encodings applied to all columns in X.

Normalization#

preprocessing.log_norm

Apply log normalization.

preprocessing.maxabs_norm

Apply max-abs normalization.

preprocessing.minmax_norm

Apply min-max normalization.

preprocessing.power_norm

Apply power transformation normalization.

preprocessing.quantile_norm

Apply quantile normalization.

preprocessing.robust_scale_norm

Apply robust scaling normalization.

preprocessing.scale_norm

Apply scaling normalization.

preprocessing.sqrt_norm

Apply square root normalization.

preprocessing.offset_negative_values

Offsets negative values into positive ones with the lowest negative value becoming 0.

Dataset Shift Correction#

Partially overlaps with dataset integration. Note that a simple batch correction method is available via pp.regress_out().

preprocessing.combat

ComBat function for batch effect correction [Johnson07] [Leek12] [Pedersen12].

Neighbors#

preprocessing.neighbors

Compute a neighborhood graph of observations [McInnes18].

Tools#

Any transformation of the data matrix that is not preprocessing. In contrast to a preprocessing function, a tool usually adds an easily interpretable annotation to the data matrix, which can then be visualized with a corresponding plotting function.

Embeddings#

tools.pca

Computes a principal component analysis.

tools.tsne

Calculates t-SNE [Maaten08] [Amir13] [Pedregosa11].

tools.umap

Embed the neighborhood graph using UMAP [McInnes18].

tools.draw_graph

Force-directed graph drawing [Islam11] [Jacomy14] [Chippada18].

tools.diffmap

Diffusion Maps [Coifman05] [Haghverdi15] [Wolf18].

tools.embedding_density

Calculate the density of observation in an embedding (per condition).

Clustering and trajectory inference#

tools.leiden

Cluster observations into subgroups [Traag18].

tools.louvain

Cluster observations into subgroups [Blondel08] [Levine15] [Traag17].

tools.dendrogram

Computes a hierarchical clustering for the given groupby categories.

tools.dpt

Infer progression of observations through geodesic distance along the graph [Haghverdi16] [Wolf19].

tools.paga

Mapping out the coarse-grained connectivity structures of complex manifolds [Wolf19].

Group comparison#

tools.rank_features_groups

Rank features for characterizing groups.

tools.filter_rank_features_groups

Filters out features based on fold change and fraction of features containing the feature within and outside the groupby categories.

tools.marker_feature_overlap

Calculate an overlap score between data-deriven features and provided marker features.

Dataset integration#

tools.ingest

Map labels and embeddings from reference data to new data.

Natural language processing#

tools.Translator

Class providing an interface to all translation functions.

tools.MedCAT

Wrapper class for Medcat.

tools.mc.run_unsupervised_training

Performs MedCAT unsupervised training on a provided text column.

tools.mc.annotate_text

Annotate the original free text data.

tools.mc.get_annotation_overview

Provide an overview for the annotation results.

Survival Analysis#

tools.ols

Create a Ordinary Least Squares (OLS) Model from a formula and AnnData.

tools.glm

Create a Generalized Linear Model (GLM) from a formula, a distribution, and AnnData.

tools.kmf

Fit the Kaplan-Meier estimate for the survival function.

tools.test_kmf_logrank

Calculates the p-value for the logrank test comparing the survival functions of two groups.

tools.test_nested_f_statistic

Given two fitted GLMs, the larger of which contains the parameter space of the smaller, return the P value corresponding to the larger model adding explanatory power.

Causal Inference#

tools.causal_inference

Performs causal inference on an AnnData object using the specified causal model and returns a tuple containing the causal estimate and the results of any refutation tests.

Plotting#

The plotting module ehrapy.pl.* largely parallels the tl.* and a few of the pp.* functions. For most tools and for some preprocessing functions, you will find a plotting function with the same name.

Generic#

plot.scatter

Scatter plot along observations or variables axes.

plot.heatmap

Heatmap of the feature values.

plot.dotplot

Makes a dot plot of the count values of var_names.

plot.tracksplot

Plots a filled line plot.

plot.violin

Violin plot.

plot.stacked_violin

Stacked violin plots.

plot.matrixplot

Creates a heatmap of the mean count per group of each var_names.

plot.clustermap

Hierarchically-clustered heatmap.

plot.ranking

Plot rankings.

plot.dendrogram

Plots a dendrogram of the categories defined in groupby.

Quality Control and missing values#

plot.qc_metrics

Plots the calculated quality control metrics for var of adata.

plot.missing_values_matrix

A matrix visualization of the nullity of the given AnnData object.

plot.missing_values_barplot

A bar chart visualization of the nullity of the given AnnData object.

plot.missing_values_heatmap

Presents a seaborn heatmap visualization of nullity correlation in the given AnnData object.

plot.missing_values_dendrogram

Fits a scipy hierarchical clustering algorithm to the given AnnData object's var and visualizes the results as a scipy dendrogram.

Classes#

Please refer to Scanpy’s plotting classes documentation.

Tools#

Methods that extract and visualize tool-specific annotation in an AnnData object. For any method in module tl, there is a method with the same name in pl.

plot.pca

Scatter plot in PCA coordinates.

plot.pca_loadings

Rank features according to contributions to PCs.

plot.pca_variance_ratio

Plot the variance ratio.

plot.pca_overview

Plot PCA results.

Embeddings#

plot.tsne

Scatter plot in tSNE basis.

plot.umap

Scatter plot in UMAP basis.

plot.diffmap

Scatter plot in Diffusion Map basis.

plot.draw_graph

Scatter plot in graph-drawing basis.

plot.spatial

Scatter plot in spatial coordinates.

plot.embedding

Scatter plot for user specified embedding basis (e.g.

plot.embedding_density

Plot the density of observations in an embedding (per condition).

Branching trajectories and pseudotime, clustering#

Visualize clusters using one of the embedding methods passing color=’leiden’.

plot.dpt_groups_pseudotime

Plot groups and pseudotime.

plot.dpt_timeseries

Heatmap of pseudotime series.

plot.paga

Plot the PAGA graph through thresholding low-connectivity edges.

plot.paga_path

Feature changes along paths in the abstracted graph.

plot.paga_compare

Scatter and PAGA graph side-by-side.

Group comparison#

plot.rank_features_groups

Plot ranking of features.

plot.rank_features_groups_violin

Plot ranking of features for all tested comparisons as violin plots.

plot.rank_features_groups_stacked_violin

Plot ranking of genes using stacked_violin plot.

plot.rank_features_groups_heatmap

Plot ranking of genes using heatmap plot (see heatmap())

plot.rank_features_groups_dotplot

Plot ranking of genes using dotplot plot (see dotplot())

plot.rank_features_groups_matrixplot

Plot ranking of genes using matrixplot plot (see matrixplot())

plot.rank_features_groups_tracksplot

Plot ranking of genes using tracksplot plot (see tracksplot())

Survival Analysis#

plot.ols

Plots a Ordinary Least Squares (OLS) Model result, scatter plot, and line plot.

plot.kmf

Plots a pretty figure of the Fitted KaplanMeierFitter model

Causal Inference#

plot.causal_effect

Plot the causal effect estimate.

AnnData utilities#

The ehrapy API exposes functions to transform a pandas dataframe into an AnnData object and vice versa.

anndata.df_to_anndata

Transform a given pandas dataframe into an AnnData object.

anndata.anndata_to_df

Transform an AnnData object to a pandas dataframe.

anndata.move_to_obs

Move inplace or copy features from X to obs.

anndata.delete_from_obs

Delete features from obs.

anndata.move_to_x

Move features from obs to X inplace.

anndata.get_obs_df

Return values for observations in adata.

anndata.get_var_df

Return values for observations in adata.

anndata.get_rank_features_df

ehrapy.tl.rank_features_groups() results in the form of a DataFrame.

anndata.type_overview

Prints the current state of an AnnData object in a tree format.

Settings#

A convenience object for setting some default matplotlib.rcParams and a high-resolution jupyter display backend useful for use in notebooks.

An instance of the ScanpyConfig is available as ehrapy.settings and allows configuring ehrapy.

import ehrapy as ep
ep.settings.set_figure_params(dpi=150)

Please refer to the Scanpy settings documentation for configuration options. ehrapy will adapt these in the future and update the documentation.

Dependency Versions#

ehrapy is complex software with many dependencies. To ensure a consistent runtime environment you should save all versions that were used for an analysis. This comes in handy when trying to diagnose issues and to reproduce results.

Call the function via:

import ehrapy as ep
ep.print_versions()