Usage¶
Import the ehrapy API as follows:
import ehrapy as ep
You can then access the respective modules like:
ep.pl.cool_fancy_plot()
Reading and writing¶
Reads or downloads a desired directory of csv/tsv files or a single csv/tsv file. |
|
Reads or downloads a desired directory of h5ad files or a single h5ad file. |
|
Reads one or multiple FHIR files using fhiry. |
|
Write |
Data¶
Loads the MIMIC-II dataset. |
|
Loads the preprocessed MIMIC-II dataset. |
|
Loads the MIMIC-III demo dataset as a dictionary of Pandas DataFrames. |
|
Loads the raw diabetes-130 dataset |
|
Loads the preprocessed diabetes-130 dataset by fairlearn |
|
Loads the heart failure dataset. |
|
Loads the Chronic Kidney Disease dataset |
|
Loads the Breast Tissue Data Set |
|
Loads the Cervical cancer (Risk Factors) Data Set |
|
Loads the Dermatology Data Set |
|
Loads the Echocardiogram Data Set |
|
Loads the Heart Disease Data Set |
|
Loads the Hepatitis Data Set |
|
Loads the Statlog (Heart) Data Set |
|
Loads the Thyroid Data Set |
|
Loads the Breast Cancer Coimbra Data Set |
|
Loads the Parkinson Dataset with replicated acoustic features Data Set |
|
Loads the Parkinsons Data Set |
|
Loads the Parkinson's Disease Classification Data Set |
|
Loads the Parkinsons Telemonitoring Data Set |
Preprocessing¶
Any transformation of the data matrix that is not a tool. Other than tools, preprocessing steps usually don’t return an easily interpretable annotation, but perform a basic transformation on the data matrix.
Basic preprocessing¶
Encode categoricals of an |
|
Computes a principal component analysis. |
|
Regress out (mostly) unwanted sources of variation. |
|
Subsample to a fraction of the number of observations. |
|
Balancing groups in the dataset. |
|
Annotate highly variable features. |
|
Returns a Winsorized version of the input array. |
|
Clips (limits) features. |
|
Summarizes numerical measurements into minimum, maximum and average values. |
Quality control¶
Calculates various quality control metrics. |
|
Examines lab measurements for reference ranges and outliers. |
|
Statistical hypothesis test for Missing Completely At Random (MCAR). |
|
Detects biases in the data using feature correlations, standardized mean differences, and feature importances. |
Imputation¶
Replaces all missing values in all columns or a subset of columns specified by the user with the passed replacement value. |
|
Impute missing values in numerical data using mean/median/most frequent imputation. |
|
Imputes missing values in the input AnnData object using K-nearest neighbor imputation. |
|
Impute data using the MissForest strategy. |
|
Impute data using the miceforest. |
Normalization¶
Apply log normalization. |
|
Apply max-abs normalization. |
|
Apply min-max normalization. |
|
Apply power transformation normalization. |
|
Apply quantile normalization. |
|
Apply robust scaling normalization. |
|
Apply scaling normalization. |
|
Offsets negative values into positive ones with the lowest negative value becoming 0. |
Dataset Shift Correction¶
Partially overlaps with dataset integration. Note that a simple batch correction method is available via pp.regress_out()
.
ComBat function for batch effect correction [Johnson07] [Leek12] [Pedersen12]. |
Neighbors¶
Compute a neighborhood graph of observations [McInnes18]. |
Tools¶
Any transformation of the data matrix that is not preprocessing. In contrast to a preprocessing function, a tool usually adds an easily interpretable annotation to the data matrix, which can then be visualized with a corresponding plotting function.
Embeddings¶
Calculates t-SNE [Maaten08] [Amir13] [Pedregosa11]. |
|
Embed the neighborhood graph using UMAP [McInnes18]. |
|
Force-directed graph drawing [Islam11] [Jacomy14] [Chippada18]. |
|
Diffusion Maps [Coifman05] [Haghverdi15] [Wolf18]. |
|
Calculate the density of observation in an embedding (per condition). |
Clustering and trajectory inference¶
Cluster observations into subgroups [Traag18]. |
|
Computes a hierarchical clustering for the given groupby categories. |
|
Infer progression of observations through geodesic distance along the graph [Haghverdi16] [Wolf19]. |
|
Mapping out the coarse-grained connectivity structures of complex manifolds [Wolf19]. |
Feature Ranking¶
Rank features for characterizing groups. |
|
Filters out features based on fold change and fraction of features containing the feature within and outside the groupby categories. |
|
Calculate feature importances for predicting a specified feature in adata.var. |
Dataset integration¶
Map labels and embeddings from reference data to new data. |
Natural language processing¶
Annotate the original free text data. |
|
Provide an overview for the annotation results. |
|
Add info extracted from free text as a binary column to obs. |
Survival Analysis¶
Create an Ordinary Least Squares (OLS) Model from a formula and AnnData. |
|
Create a Generalized Linear Model (GLM) from a formula, a distribution, and AnnData. |
|
Fit the Kaplan-Meier estimate for the survival function. |
|
Calculates the p-value for the logrank test comparing the survival functions of two groups. |
|
Calculate the P value indicating if a larger GLM, encompassing a smaller GLM's parameters, adds explanatory power. |
|
Fit the Cox’s proportional hazard for the survival function. |
|
Fit the Weibull accelerated failure time regression for the survival function. |
|
Fit the log logistic accelerated failure time regression for the survival function. |
|
Employ the Nelson-Aalen estimator to estimate the cumulative hazard function from censored survival data |
|
Employ the Weibull model in univariate survival analysis to understand event occurrence dynamics. |
Causal Inference¶
Performs causal inference on an AnnData object using the specified causal model and returns a tuple containing the causal estimate and the results of any refutation tests. |
Cohort Tracking¶
Track cohort changes over multiple filtering or processing steps. |
Plotting¶
The plotting module ehrapy.pl.\*
largely parallels the tl.\*
and a few of the pp.\*
functions.
For most tools and for some preprocessing functions, you will find a plotting function with the same name.
Generic¶
Scatter plot along observations or variables axes. |
|
Heatmap of the feature values. |
|
Makes a dot plot of the count values of var_names. |
|
Plots a filled line plot. |
|
Violin plot. |
|
Stacked violin plots. |
|
Creates a heatmap of the mean count per group of each var_names. |
|
Hierarchically-clustered heatmap. |
|
Plot rankings. |
|
Plots a dendrogram of the categories defined in groupby. |
|
Plot categorical data. |
Quality Control and missing values¶
A matrix visualization of the nullity of the given AnnData object. |
|
A bar chart visualization of the nullity of the given AnnData object. |
|
Presents a seaborn heatmap visualization of nullity correlation in the given AnnData object. |
|
Fits a scipy hierarchical clustering algorithm to the given AnnData object's var and visualizes the results as a scipy dendrogram. |
Classes¶
Please refer to Scanpy’s plotting classes documentation.
Tools¶
Methods that extract and visualize tool-specific annotation in an AnnData object. For any method in module tl
, there is a method with the same name in pl
.
Scatter plot in PCA coordinates. |
|
Rank features according to contributions to PCs. |
|
Plot the variance ratio. |
|
Plot PCA results. |
Embeddings¶
Scatter plot in tSNE basis. |
|
Scatter plot in UMAP basis. |
|
Scatter plot in Diffusion Map basis. |
|
Scatter plot in graph-drawing basis. |
|
Scatter plot for user specified embedding basis (e.g. umap, pca, etc). |
|
Plot the density of observations in an embedding (per condition). |
Branching trajectories and pseudotime¶
Plot groups and pseudotime. |
|
Heatmap of pseudotime series. |
|
Plot the PAGA graph through thresholding low-connectivity edges. |
|
Feature changes along paths in the abstracted graph. |
|
Scatter and PAGA graph side-by-side. |
Feature Ranking¶
Plot ranking of features. |
|
Plot ranking of features for all tested comparisons as violin plots. |
|
Plot ranking of genes using stacked_violin plot. |
|
Plot ranking of genes using heatmap plot (see |
|
Plot ranking of genes using dotplot plot (see |
|
Plot ranking of genes using matrixplot plot (see |
|
Plot ranking of genes using tracksplot plot (see |
|
Plot features with greatest absolute importances as a barplot. |
Survival Analysis¶
Causal Inference¶
Plot the causal effect estimate. |
AnnData utilities¶
Infer feature types from AnnData object. |
|
Print an overview of the feature types and encoding modes in the AnnData object. |
|
Correct the feature types for a list of features inplace. |
|
Transform a given Pandas DataFrame into an AnnData object. |
|
Transform an AnnData object to a Pandas DataFrame. |
|
Move inplace or copy features from X to obs. |
|
Delete features from obs. |
|
Move features from obs to X inplace. |
|
Return values for observations in adata. |
|
Return values for observations in adata. |
|
|
Settings¶
A convenience object for setting some default matplotlib.rcParams
and a
high-resolution jupyter display backend useful for use in notebooks.
An instance of the ScanpyConfig
is available as ehrapy.settings
and allows configuring ehrapy.
import ehrapy as ep
ep.settings.set_figure_params(dpi=150)
Please refer to the Scanpy settings documentation for configuration options. ehrapy will adapt these in the future and update the documentation.
Dependency Versions¶
ehrapy is complex software with many dependencies. To ensure a consistent runtime environment you should save all versions that were used for an analysis. This comes in handy when trying to diagnose issues and to reproduce results.
Call the function via:
import ehrapy as ep
ep.print_versions()