Usage

Import the ehrapy API as follows:

import ehrapy as ep

You can then access the respective modules like:

ep.pl.cool_fancy_plot()

Reading and writing

io.read_csv

Reads or downloads a desired directory of csv/tsv files or a single csv/tsv file.

io.read_h5ad

Reads or downloads a desired directory of h5ad files or a single h5ad file.

io.read_fhir

Reads one or multiple FHIR files using fhiry.

io.write

Write AnnData objects to file.

Data

data.mimic_2

Loads the MIMIC-II dataset.

data.mimic_2_preprocessed

Loads the preprocessed MIMIC-II dataset.

data.mimic_3_demo

Loads the MIMIC-III demo dataset as a dictionary of Pandas DataFrames.

data.diabetes_130_raw

Loads the raw diabetes-130 dataset

data.diabetes_130_fairlearn

Loads the preprocessed diabetes-130 dataset by fairlearn

data.heart_failure

Loads the heart failure dataset.

data.chronic_kidney_disease

Loads the Chronic Kidney Disease dataset

data.breast_tissue

Loads the Breast Tissue Data Set

data.cervical_cancer_risk_factors

Loads the Cervical cancer (Risk Factors) Data Set

data.dermatology

Loads the Dermatology Data Set

data.echocardiogram

Loads the Echocardiogram Data Set

data.heart_disease

Loads the Heart Disease Data Set

data.hepatitis

Loads the Hepatitis Data Set

data.statlog_heart

Loads the Statlog (Heart) Data Set

data.thyroid

Loads the Thyroid Data Set

data.breast_cancer_coimbra

Loads the Breast Cancer Coimbra Data Set

data.parkinson_dataset_with_replicated_acoustic_features

Loads the Parkinson Dataset with replicated acoustic features Data Set

data.parkinsons

Loads the Parkinsons Data Set

data.parkinsons_disease_classification

Loads the Parkinson's Disease Classification Data Set

data.parkinsons_telemonitoring

Loads the Parkinsons Telemonitoring Data Set

Preprocessing

Any transformation of the data matrix that is not a tool. Other than tools, preprocessing steps usually don’t return an easily interpretable annotation, but perform a basic transformation on the data matrix.

Basic preprocessing

preprocessing.pca

Computes a principal component analysis.

preprocessing.regress_out

Regress out (mostly) unwanted sources of variation.

preprocessing.subsample

Subsample to a fraction of the number of observations.

preprocessing.balanced_sample

Balancing groups in the dataset.

preprocessing.highly_variable_features

Annotate highly variable features.

preprocessing.winsorize

Returns a Winsorized version of the input array.

preprocessing.clip_quantile

Clips (limits) features.

preprocessing.summarize_measurements

Summarizes numerical measurements into minimum, maximum and average values.

Quality control

preprocessing.qc_metrics

Calculates various quality control metrics.

preprocessing.qc_lab_measurements

Examines lab measurements for reference ranges and outliers.

preprocessing.mcar_test

Statistical hypothesis test for Missing Completely At Random (MCAR).

Imputation

preprocessing.explicit_impute

Replaces all missing values in all columns or a subset of columns specified by the user with the passed replacement value.

preprocessing.simple_impute

Impute missing values in numerical data using mean/median/most frequent imputation.

preprocessing.knn_impute

Imputes missing values in the input AnnData object using K-nearest neighbor imputation.

preprocessing.miss_forest_impute

Impute data using the MissForest strategy.

preprocessing.soft_impute

Impute data using the SoftImpute.

preprocessing.iterative_svd_impute

Impute missing values in an AnnData object using the IterativeSVD algorithm.

preprocessing.matrix_factorization_impute

Impute data using the MatrixFactorization.

preprocessing.nuclear_norm_minimization_impute

Impute data using the NuclearNormMinimization.

preprocessing.mice_forest_impute

Impute data using the miceforest.

Encoding

preprocessing.encode

Encode categoricals of an AnnData object.

Normalization

preprocessing.log_norm

Apply log normalization.

preprocessing.maxabs_norm

Apply max-abs normalization.

preprocessing.minmax_norm

Apply min-max normalization.

preprocessing.power_norm

Apply power transformation normalization.

preprocessing.quantile_norm

Apply quantile normalization.

preprocessing.robust_scale_norm

Apply robust scaling normalization.

preprocessing.scale_norm

Apply scaling normalization.

preprocessing.sqrt_norm

Apply square root normalization.

preprocessing.offset_negative_values

Offsets negative values into positive ones with the lowest negative value becoming 0.

Dataset Shift Correction

Partially overlaps with dataset integration. Note that a simple batch correction method is available via pp.regress_out().

preprocessing.combat

ComBat function for batch effect correction [Johnson07] [Leek12] [Pedersen12].

Neighbors

preprocessing.neighbors

Compute a neighborhood graph of observations [McInnes18].

Tools

Any transformation of the data matrix that is not preprocessing. In contrast to a preprocessing function, a tool usually adds an easily interpretable annotation to the data matrix, which can then be visualized with a corresponding plotting function.

Embeddings

tools.tsne

Calculates t-SNE [Maaten08] [Amir13] [Pedregosa11].

tools.umap

Embed the neighborhood graph using UMAP [McInnes18].

tools.draw_graph

Force-directed graph drawing [Islam11] [Jacomy14] [Chippada18].

tools.diffmap

Diffusion Maps [Coifman05] [Haghverdi15] [Wolf18].

tools.embedding_density

Calculate the density of observation in an embedding (per condition).

Clustering and trajectory inference

tools.leiden

Cluster observations into subgroups [Traag18].

tools.dendrogram

Computes a hierarchical clustering for the given groupby categories.

tools.dpt

Infer progression of observations through geodesic distance along the graph [Haghverdi16] [Wolf19].

tools.paga

Mapping out the coarse-grained connectivity structures of complex manifolds [Wolf19].

Feature Ranking

tools.rank_features_groups

Rank features for characterizing groups.

tools.filter_rank_features_groups

Filters out features based on fold change and fraction of features containing the feature within and outside the groupby categories.

tools.rank_features_supervised

Calculate feature importances for predicting a specified feature in adata.var.

Dataset integration

tools.ingest

Map labels and embeddings from reference data to new data.

Natural language processing

tools.Translator

Class providing an interface to all translation functions.

tools.annotate_text

Annotate the original free text data.

tools.get_medcat_annotation_overview

Provide an overview for the annotation results.

tools.add_medcat_annotation_to_obs

Add info extracted from free text as a binary column to obs.

Survival Analysis

tools.ols

Create a Ordinary Least Squares (OLS) Model from a formula and AnnData.

tools.glm

Create a Generalized Linear Model (GLM) from a formula, a distribution, and AnnData.

tools.kmf

Fit the Kaplan-Meier estimate for the survival function.

tools.test_kmf_logrank

Calculates the p-value for the logrank test comparing the survival functions of two groups.

tools.test_nested_f_statistic

Calculate the P value indicating if a larger GLM, encompassing a smaller GLM's parameters, adds explanatory power.

tools.cox_ph

Fit the Cox’s proportional hazard for the survival function.

tools.weibull_aft

Fit the Weibull accelerated failure time regression for the survival function.

tools.log_logistic_aft

Fit the log logistic accelerated failure time regression for the survival function.

tools.nelson_alen

Employ the Nelson-Aalen estimator to estimate the cumulative hazard function from censored survival data

tools.weibull

Employ the Weibull model in univariate survival analysis to understand event occurrence dynamics.

Causal Inference

tools.causal_inference

Performs causal inference on an AnnData object using the specified causal model and returns a tuple containing the causal estimate and the results of any refutation tests.

Cohort Tracking

tools.CohortTracker

Track cohort changes over multiple filtering or processing steps.

Plotting

The plotting module ehrapy.pl.\* largely parallels the tl.\* and a few of the pp.\* functions. For most tools and for some preprocessing functions, you will find a plotting function with the same name.

Generic

plot.scatter

Scatter plot along observations or variables axes.

plot.heatmap

Heatmap of the feature values.

plot.dotplot

Makes a dot plot of the count values of var_names.

plot.tracksplot

Plots a filled line plot.

plot.violin

Violin plot.

plot.stacked_violin

Stacked violin plots.

plot.matrixplot

Creates a heatmap of the mean count per group of each var_names.

plot.clustermap

Hierarchically-clustered heatmap.

plot.ranking

Plot rankings.

plot.dendrogram

Plots a dendrogram of the categories defined in groupby.

Quality Control and missing values

plot.missing_values_matrix

A matrix visualization of the nullity of the given AnnData object.

plot.missing_values_barplot

A bar chart visualization of the nullity of the given AnnData object.

plot.missing_values_heatmap

Presents a seaborn heatmap visualization of nullity correlation in the given AnnData object.

plot.missing_values_dendrogram

Fits a scipy hierarchical clustering algorithm to the given AnnData object's var and visualizes the results as a scipy dendrogram.

Classes

Please refer to Scanpy’s plotting classes documentation.

Tools

Methods that extract and visualize tool-specific annotation in an AnnData object. For any method in module tl, there is a method with the same name in pl.

plot.pca

Scatter plot in PCA coordinates.

plot.pca_loadings

Rank features according to contributions to PCs.

plot.pca_variance_ratio

Plot the variance ratio.

plot.pca_overview

Plot PCA results.

Embeddings

plot.tsne

Scatter plot in tSNE basis.

plot.umap

Scatter plot in UMAP basis.

plot.diffmap

Scatter plot in Diffusion Map basis.

plot.draw_graph

Scatter plot in graph-drawing basis.

plot.embedding

Scatter plot for user specified embedding basis (e.g. umap, pca, etc).

plot.embedding_density

Plot the density of observations in an embedding (per condition).

Branching trajectories and pseudotime

plot.dpt_groups_pseudotime

Plot groups and pseudotime.

plot.dpt_timeseries

Heatmap of pseudotime series.

plot.paga

Plot the PAGA graph through thresholding low-connectivity edges.

plot.paga_path

Feature changes along paths in the abstracted graph.

plot.paga_compare

Scatter and PAGA graph side-by-side.

Feature Ranking

plot.rank_features_groups

Plot ranking of features.

plot.rank_features_groups_violin

Plot ranking of features for all tested comparisons as violin plots.

plot.rank_features_groups_stacked_violin

Plot ranking of genes using stacked_violin plot.

plot.rank_features_groups_heatmap

Plot ranking of genes using heatmap plot (see heatmap())

plot.rank_features_groups_dotplot

Plot ranking of genes using dotplot plot (see dotplot())

plot.rank_features_groups_matrixplot

Plot ranking of genes using matrixplot plot (see matrixplot())

plot.rank_features_groups_tracksplot

Plot ranking of genes using tracksplot plot (see tracksplot())

plot.rank_features_supervised

Plot features with greates absolute importances as a barplot.

Survival Analysis

plot.ols

Plots an Ordinary Least Squares (OLS) Model result, scatter plot, and line plot.

plot.kmf

Plots a pretty figure of the Fitted KaplanMeierFitter model

Causal Inference

plot.causal_effect

Plot the causal effect estimate.

AnnData utilities

anndata.infer_feature_types

Infer feature types from AnnData object.

anndata.df_to_anndata

Transform a given pandas dataframe into an AnnData object.

anndata.anndata_to_df

Transform an AnnData object to a Pandas DataFrame.

anndata.move_to_obs

Move inplace or copy features from X to obs.

anndata.delete_from_obs

Delete features from obs.

anndata.move_to_x

Move features from obs to X inplace.

anndata.get_obs_df

Return values for observations in adata.

anndata.get_var_df

Return values for observations in adata.

anndata.get_rank_features_df

ehrapy.tl.rank_features_groups() results in the form of a DataFrame.

anndata.type_overview

Prints the current state of an AnnData object in a tree format.

Settings

A convenience object for setting some default matplotlib.rcParams and a high-resolution jupyter display backend useful for use in notebooks.

An instance of the ScanpyConfig is available as ehrapy.settings and allows configuring ehrapy.

import ehrapy as ep

ep.settings.set_figure_params(dpi=150)

Please refer to the Scanpy settings documentation for configuration options. ehrapy will adapt these in the future and update the documentation.

Dependency Versions

ehrapy is complex software with many dependencies. To ensure a consistent runtime environment you should save all versions that were used for an analysis. This comes in handy when trying to diagnose issues and to reproduce results.

Call the function via:

import ehrapy as ep

ep.print_versions()