ehrapy.preprocessing.highly_variable_features(adata, layer=None, top_features_percentage=0.2, span=0.3, n_bins=20, subset=False, inplace=True, check_values=True)[source]#

Annotate highly variable features.

Expects count data. A normalized variance for each feature is computed. First, the data are standardized (i.e., z-score normalization per feature) with a regularized standard deviation. Next, the normalized variance is computed as the variance of each feature after the transformation. Features are ranked by the normalized variance.

  • adata (AnnData) – The annotated data matrix of shape n_obs × n_vars.

  • layer (str | None) – If provided, use adata.layers[layer] for expression values instead of adata.X. Defaults to None .

  • top_features_percentage (float) – Percentage of highly-variable features to keep. Defaults to 0.2 .

  • span (float | None) – The fraction of the data used when estimating the variance in the loess model fit. Defaults to 0.3 .

  • n_bins (int) – Number of bins for binning. Normalization is done with respect to each bin. If just a single observation falls into a bin, the normalized dispersion is artificially set to 1. You’ll be informed about this if you set settings.verbosity = 4. Defaults to 20 .

  • subset (bool) – Inplace subset to highly-variable features if True otherwise merely indicate highly variable features. Defaults to False .

  • inplace (bool) – Whether to place calculated metrics in .var or return them. Defaults to True .

  • check_values (bool) – Check if counts in selected layer are integers. A Warning is returned if set to True. Defaults to True .

Return type:

DataFrame | None


Depending on inplace returns calculated metrics (DataFrame) or updates .var with the following fields


boolean indicator of highly-variable features


means per feature


variance per feature


normalized variance per feature, averaged in the case of multiple batches


rank of the feature according to normalized variance, median rank in the case of multiple batches