ehrapy.preprocessing.highly_variable_features

ehrapy.preprocessing.highly_variable_features(adata, layer=None, top_features_percentage=0.2, span=0.3, n_bins=20, subset=False, inplace=True, check_values=True)[source]

Annotate highly variable features.

Expects count data. A normalized variance for each feature is computed. First, the data are standardized (i.e., z-score normalization per feature) with a regularized standard deviation. Next, the normalized variance is computed as the variance of each feature after the transformation. Features are ranked by the normalized variance.

Parameters:
  • adata (AnnData) – The annotated data matrix of shape n_obs × n_vars.

  • layer (str | None, default: None) – If provided, use adata.layers[layer] for expression values instead of adata.X.

  • top_features_percentage (float, default: 0.2) – Percentage of highly-variable features to keep.

  • span (float | None, default: 0.3) – The fraction of the data used when estimating the variance in the loess model fit.

  • n_bins (int, default: 20) – Number of bins for binning. Normalization is done with respect to each bin. If just a single observation falls into a bin, the normalized dispersion is artificially set to 1. You’ll be informed about this if you set settings.verbosity = 4.

  • subset (bool, default: False) – Inplace subset to highly-variable features if True otherwise merely indicate highly variable features.

  • inplace (bool, default: True) – Whether to place calculated metrics in .var or return them.

  • check_values (bool, default: True) – Check if counts in selected layer are integers. A Warning is returned if set to True.

Return type:

DataFrame | None

Returns:

Depending on inplace returns calculated metrics (DataFrame) or updates .var with the following fields

highly_variable

boolean indicator of highly-variable features

means

means per feature

variances

variance per feature

variances_norm

normalized variance per feature, averaged in the case of multiple batches

highly_variable_rank

rank of the feature according to normalized variance, median rank in the case of multiple batches