ehrapy.preprocessing.highly_variable_features¶
- ehrapy.preprocessing.highly_variable_features(adata, layer=None, top_features_percentage=0.2, span=0.3, n_bins=20, subset=False, inplace=True, check_values=True)[source]¶
Annotate highly variable features.
Expects count data. A normalized variance for each feature is computed. First, the data are standardized (i.e., z-score normalization per feature) with a regularized standard deviation. Next, the normalized variance is computed as the variance of each feature after the transformation. Features are ranked by the normalized variance.
- Parameters:
adata (
AnnData
) – The annotated data matrix of shape n_obs × n_vars.layer (
str
|None
) – If provided, use adata.layers[layer] for expression values instead of adata.X. Defaults to None .top_features_percentage (
float
) – Percentage of highly-variable features to keep. Defaults to 0.2 .span (
float
|None
) – The fraction of the data used when estimating the variance in the loess model fit. Defaults to 0.3 .n_bins (
int
) – Number of bins for binning. Normalization is done with respect to each bin. If just a single observation falls into a bin, the normalized dispersion is artificially set to 1. You’ll be informed about this if you set settings.verbosity = 4. Defaults to 20 .subset (
bool
) – Inplace subset to highly-variable features if True otherwise merely indicate highly variable features. Defaults to False .inplace (
bool
) – Whether to place calculated metrics in .var or return them. Defaults to True .check_values (
bool
) – Check if counts in selected layer are integers. A Warning is returned if set to True. Defaults to True .
- Return type:
- Returns:
Depending on inplace returns calculated metrics (
DataFrame
) or updates .var with the following fields
- highly_variablebool
boolean indicator of highly-variable features
- means
means per feature
- variances
variance per feature
- variances_norm
normalized variance per feature, averaged in the case of multiple batches
- highly_variable_rankfloat
rank of the feature according to normalized variance, median rank in the case of multiple batches