Usage#

Import the ehrapy API as follows:

import ehrapy as ep

You can then access the respective modules like:

ep.pl.cool_fancy_plot()

Reading and writing#

`io.read_csv`	Reads or downloads a desired directory of csv/tsv files or a single csv/tsv file.
`io.read_h5ad`	Reads or downloads a desired directory of h5ad files or a single h5ad file.
`io.read_fhir`	Reads one or multiple FHIR files using fhiry.
`io.write`	Write `AnnData` objects to file.

Data#

`data.mimic_2`	Loads the MIMIC-II dataset.
`data.mimic_2_preprocessed`	Loads the preprocessed MIMIC-II dataset.
`data.mimic_3_demo`	Loads the MIMIC-III demo dataset as a dictionary of Pandas DataFrames.
`data.diabetes_130`	Loads the diabetes-130 dataset
`data.heart_failure`	Loads the heart failure dataset.
`data.chronic_kidney_disease`	Loads the Chronic Kidney Disease dataset
`data.breast_tissue`	Loads the Breast Tissue Data Set
`data.cervical_cancer_risk_factors`	Loads the Cervical cancer (Risk Factors) Data Set
`data.dermatology`	Loads the Dermatology Data Set
`data.echocardiogram`	Loads the Echocardiogram Data Set
`data.heart_disease`	Loads the Heart Disease Data Set
`data.hepatitis`	Loads the Hepatitis Data Set
`data.statlog_heart`	Loads the Statlog (Heart) Data Set
`data.thyroid`	Loads the Thyroid Data Set
`data.breast_cancer_coimbra`	Loads the Breast Cancer Coimbra Data Set
`data.parkinson_dataset_with_replicated_acoustic_features`	Loads the Parkinson Dataset with replicated acoustic features Data Set
`data.parkinsons`	Loads the Parkinsons Data Set
`data.parkinsons_disease_classification`	Loads the Parkinson's Disease Classification Data Set
`data.parkinsons_telemonitoring`	Loads the Parkinsons Telemonitoring Data Set

Preprocessing#

Any transformation of the data matrix that is not a tool. Other than tools, preprocessing steps usually don’t return an easily interpretable annotation, but perform a basic transformation on the data matrix.

Basic preprocessing#

`preprocessing.pca`	Computes a principal component analysis.
`preprocessing.regress_out`	Regress out (mostly) unwanted sources of variation.
`preprocessing.subsample`	Subsample to a fraction of the number of observations.
`preprocessing.highly_variable_features`	Annotate highly variable features.
`preprocessing.winsorize`	Returns a Winsorized version of the input array.
`preprocessing.clip_quantile`	Clips (limits) features.

Quality control#

`preprocessing.qc_metrics`	Calculates various quality control metrics.
`preprocessing.qc_lab_measurements`	Examines lab measurements for reference ranges and outliers.

Imputation#

`preprocessing.explicit_impute`	Replaces all missing values in all columns or a subset of columns specified by the user with the passed replacement value.
`preprocessing.simple_impute`	Impute missing values in numerical data using mean/median/most frequent imputation.
`preprocessing.knn_impute`	Imputes missing values in the input AnnData object using K-nearest neighbor imputation.
`preprocessing.miss_forest_impute`	Impute data using the MissForest strategy.
`preprocessing.soft_impute`	Impute data using the SoftImpute.
`preprocessing.iterative_svd_impute`	Impute missing values in an AnnData object using the IterativeSVD algorithm.
`preprocessing.matrix_factorization_impute`	Impute data using the MatrixFactorization.
`preprocessing.nuclear_norm_minimization_impute`	Impute data using the NuclearNormMinimization.
`preprocessing.mice_forest_impute`	Impute data using the miceforest.

Encoding#

`preprocessing.encode`	Encode categoricals of an `AnnData` object.
`preprocessing.undo_encoding`	Undo the current encodings applied to all columns in X.

Normalization#

`preprocessing.log_norm`	Apply log normalization.
`preprocessing.maxabs_norm`	Apply max-abs normalization.
`preprocessing.minmax_norm`	Apply min-max normalization.
`preprocessing.power_norm`	Apply power transformation normalization.
`preprocessing.quantile_norm`	Apply quantile normalization.
`preprocessing.robust_scale_norm`	Apply robust scaling normalization.
`preprocessing.scale_norm`	Apply scaling normalization.
`preprocessing.sqrt_norm`	Apply square root normalization.
`preprocessing.offset_negative_values`	Offsets negative values into positive ones with the lowest negative value becoming 0.

Dataset Shift Correction#

Partially overlaps with dataset integration. Note that a simple batch correction method is available via pp.regress_out().

preprocessing.combat

ComBat function for batch effect correction [Johnson07] [Leek12] [Pedersen12].

Neighbors#

preprocessing.neighbors

Compute a neighborhood graph of observations [McInnes18].

Tools#

Any transformation of the data matrix that is not preprocessing. In contrast to a preprocessing function, a tool usually adds an easily interpretable annotation to the data matrix, which can then be visualized with a corresponding plotting function.

Embeddings#

`tools.pca`	Computes a principal component analysis.
`tools.tsne`	Calculates t-SNE [Maaten08] [Amir13] [Pedregosa11].
`tools.umap`	Embed the neighborhood graph using UMAP [McInnes18].
`tools.draw_graph`	Force-directed graph drawing [Islam11] [Jacomy14] [Chippada18].
`tools.diffmap`	Diffusion Maps [Coifman05] [Haghverdi15] [Wolf18].
`tools.embedding_density`	Calculate the density of observation in an embedding (per condition).

Clustering and trajectory inference#

`tools.leiden`	Cluster observations into subgroups [Traag18].
`tools.louvain`	Cluster observations into subgroups [Blondel08] [Levine15] [Traag17].
`tools.dendrogram`	Computes a hierarchical clustering for the given groupby categories.
`tools.dpt`	Infer progression of observations through geodesic distance along the graph [Haghverdi16] [Wolf19].
`tools.paga`	Mapping out the coarse-grained connectivity structures of complex manifolds [Wolf19].

Group comparison#

`tools.rank_features_groups`	Rank features for characterizing groups.
`tools.filter_rank_features_groups`	Filters out features based on fold change and fraction of features containing the feature within and outside the groupby categories.
`tools.marker_feature_overlap`	Calculate an overlap score between data-deriven features and provided marker features.

Dataset integration#

tools.ingest

Map labels and embeddings from reference data to new data.

Natural language processing#

`tools.Translator`	Class providing an interface to all translation functions.
`tools.MedCAT`	Wrapper class for Medcat.
`tools.mc.run_unsupervised_training`	Performs MedCAT unsupervised training on a provided text column.
`tools.mc.annotate_text`	Annotate the original free text data.
`tools.mc.get_annotation_overview`	Provide an overview for the annotation results.

Survival Analysis#

`tools.ols`	Create a Ordinary Least Squares (OLS) Model from a formula and AnnData.
`tools.glm`	Create a Generalized Linear Model (GLM) from a formula, a distribution, and AnnData.
`tools.kmf`	Fit the Kaplan-Meier estimate for the survival function.
`tools.test_kmf_logrank`	Calculates the p-value for the logrank test comparing the survival functions of two groups.
`tools.test_nested_f_statistic`	Given two fitted GLMs, the larger of which contains the parameter space of the smaller, return the P value corresponding to the larger model adding explanatory power.

Causal Inference#

tools.causal_inference

Performs causal inference on an AnnData object using the specified causal model and returns a tuple containing the causal estimate and the results of any refutation tests.

Plotting#

The plotting module ehrapy.pl.* largely parallels the tl.* and a few of the pp.* functions. For most tools and for some preprocessing functions, you will find a plotting function with the same name.

Generic#

`plot.scatter`	Scatter plot along observations or variables axes.
`plot.heatmap`	Heatmap of the feature values.
`plot.dotplot`	Makes a dot plot of the count values of var_names.
`plot.tracksplot`	Plots a filled line plot.
`plot.violin`	Violin plot.
`plot.stacked_violin`	Stacked violin plots.
`plot.matrixplot`	Creates a heatmap of the mean count per group of each var_names.
`plot.clustermap`	Hierarchically-clustered heatmap.
`plot.ranking`	Plot rankings.
`plot.dendrogram`	Plots a dendrogram of the categories defined in groupby.

Quality Control and missing values#

`plot.qc_metrics`	Plots the calculated quality control metrics for var of adata.
`plot.missing_values_matrix`	A matrix visualization of the nullity of the given AnnData object.
`plot.missing_values_barplot`	A bar chart visualization of the nullity of the given AnnData object.
`plot.missing_values_heatmap`	Presents a seaborn heatmap visualization of nullity correlation in the given AnnData object.
`plot.missing_values_dendrogram`	Fits a scipy hierarchical clustering algorithm to the given AnnData object's var and visualizes the results as a scipy dendrogram.

Classes#

Please refer to Scanpy’s plotting classes documentation.

Tools#

Methods that extract and visualize tool-specific annotation in an AnnData object. For any method in module tl, there is a method with the same name in pl.

`plot.pca`	Scatter plot in PCA coordinates.
`plot.pca_loadings`	Rank features according to contributions to PCs.
`plot.pca_variance_ratio`	Plot the variance ratio.
`plot.pca_overview`	Plot PCA results.

Embeddings#

`plot.tsne`	Scatter plot in tSNE basis.
`plot.umap`	Scatter plot in UMAP basis.
`plot.diffmap`	Scatter plot in Diffusion Map basis.
`plot.draw_graph`	Scatter plot in graph-drawing basis.
`plot.spatial`	Scatter plot in spatial coordinates.
`plot.embedding`	Scatter plot for user specified embedding basis (e.g.
`plot.embedding_density`	Plot the density of observations in an embedding (per condition).

Branching trajectories and pseudotime, clustering#

Visualize clusters using one of the embedding methods passing color=’leiden’.

`plot.dpt_groups_pseudotime`	Plot groups and pseudotime.
`plot.dpt_timeseries`	Heatmap of pseudotime series.
`plot.paga`	Plot the PAGA graph through thresholding low-connectivity edges.
`plot.paga_path`	Feature changes along paths in the abstracted graph.
`plot.paga_compare`	Scatter and PAGA graph side-by-side.

Group comparison#

`plot.rank_features_groups`	Plot ranking of features.
`plot.rank_features_groups_violin`	Plot ranking of features for all tested comparisons as violin plots.
`plot.rank_features_groups_stacked_violin`	Plot ranking of genes using stacked_violin plot.
`plot.rank_features_groups_heatmap`	Plot ranking of genes using heatmap plot (see `heatmap()`)
`plot.rank_features_groups_dotplot`	Plot ranking of genes using dotplot plot (see `dotplot()`)
`plot.rank_features_groups_matrixplot`	Plot ranking of genes using matrixplot plot (see `matrixplot()`)
`plot.rank_features_groups_tracksplot`	Plot ranking of genes using tracksplot plot (see `tracksplot()`)

Survival Analysis#

`plot.ols`	Plots a Ordinary Least Squares (OLS) Model result, scatter plot, and line plot.
`plot.kmf`	Plots a pretty figure of the Fitted KaplanMeierFitter model

Causal Inference#

plot.causal_effect

Plot the causal effect estimate.

AnnData utilities#

The ehrapy API exposes functions to transform a pandas dataframe into an AnnData object and vice versa.

`anndata.df_to_anndata`	Transform a given pandas dataframe into an AnnData object.
`anndata.anndata_to_df`	Transform an AnnData object to a pandas dataframe.
`anndata.move_to_obs`	Move inplace or copy features from X to obs.
`anndata.delete_from_obs`	Delete features from obs.
`anndata.move_to_x`	Move features from obs to X inplace.
`anndata.get_obs_df`	Return values for observations in adata.
`anndata.get_var_df`	Return values for observations in adata.
`anndata.get_rank_features_df`	`ehrapy.tl.rank_features_groups()` results in the form of a `DataFrame`.
`anndata.type_overview`	Prints the current state of an `AnnData` object in a tree format.

Settings#

A convenience object for setting some default matplotlib.rcParams and a high-resolution jupyter display backend useful for use in notebooks.

An instance of the ScanpyConfig is available as ehrapy.settings and allows configuring ehrapy.

import ehrapy as ep
ep.settings.set_figure_params(dpi=150)

Please refer to the Scanpy settings documentation for configuration options. ehrapy will adapt these in the future and update the documentation.

Dependency Versions#

ehrapy is complex software with many dependencies. To ensure a consistent runtime environment you should save all versions that were used for an analysis. This comes in handy when trying to diagnose issues and to reproduce results.

Call the function via:

import ehrapy as ep
ep.print_versions()