ehrapy.tools.rank_features_groups

ehrapy.tools.rank_features_groups(adata, groupby, groups='all', reference='rest', n_features=None, rankby_abs=False, pts=False, key_added='rank_features_groups', copy=False, num_cols_method=None, cat_cols_method='g-test', correction_method='benjamini-hochberg', tie_correct=False, layer=None, field_to_rank='layer', columns_to_rank='all', **kwds)[source]

Rank features for characterizing groups.

Expects logarithmized data.

Parameters:
  • adata (AnnData) – Annotated data matrix.

  • groupby (str) – The key of the observations grouping to consider.

  • groups (Union[Literal['all'], Iterable[str]]) – Subset of groups, e.g. [‘g1’, ‘g2’, ‘g3’], to which comparison shall be restricted, or ‘all’ (default), for all groups.

  • reference (str) – If ‘rest’, compare each group to the union of the rest of the group. If a group identifier, compare with respect to this group.

  • n_features (int | None) – The number of features that appear in the returned tables. Defaults to all features.

  • rankby_abs (bool) – Rank genes by the absolute value of the score, not by the score. The returned scores are never the absolute values.

  • pts (bool) – Compute the fraction of observations containing the features.

  • key_added (str | None) – The key in adata.uns information is saved to.

  • copy (bool) – Whether to return a copy of the AnnData object.

  • num_cols_method (Optional[Literal['logreg', 't-test', 'wilcoxon', 't-test_overestim_var']]) – Statistical method to rank numerical features. The default method is ‘t-test’, ‘t-test_overestim_var’ overestimates variance of each group, ‘wilcoxon’ uses Wilcoxon rank-sum, ‘logreg’ uses logistic regression.

  • cat_cols_method (Literal['chi-square', 'g-test', 'freeman-tukey', 'mod-log-likelihood', 'neyman', 'cressie-read']) – Statistical method to calculate differences between categorical features. The default method is ‘g-test’, ‘Chi-square’ tests goodness-of-fit test for categorical data, ‘Freeman-Tukey’ tests comparing frequency distributions, ‘Mod-log-likelihood’ maximum likelihood estimation, ‘Neyman’ tests hypotheses using asymptotic theory, ‘Cressie-Read’ is a generalized likelihood test,

  • correction_method (Literal['benjamini-hochberg', 'bonferroni']) – p-value correction method. Used only for statistical tests (e.g. doesn’t work for “logreg” num_cols_method)

  • tie_correct (bool) – Use tie correction for ‘wilcoxon’ scores. Used only for ‘wilcoxon’.

  • layer (str | None) – Key from adata.layers whose value will be used to perform tests on.

  • field_to_rank (Union[Literal['layer'], Literal['obs'], Literal['layer_and_obs']]) – Set to layer to rank variables in adata.X or adata.layers[layer] (default), obs to rank adata.obs, or layer_and_obs to rank both. Layer needs to be None if this is not ‘layer’.

  • columns_to_rank (Union[dict[str, Iterable[str]], Literal['all']]) – Subset of columns to rank. If ‘all’, all columns are used. If a dictionary, it must have keys ‘var_names’ and/or ‘obs_names’ and values must be iterables of strings such as {‘var_names’: [‘glucose’], ‘obs_names’: [‘age’, ‘height’]}.

  • **kwds – Are passed to test methods. Currently this affects only parameters that are passed to sklearn.linear_model.LogisticRegression. For instance, you can pass penalty=’l1’ to try to come up with a minimal set of genes that are good predictors (sparse solution meaning few non-zero fitted coefficients).

Return type:

None

Returns:

names structured np.ndarray (.uns[‘rank_features_groups’])

Structured array to be indexed by group id storing the gene names. Ordered according to scores.

scores structured np.ndarray (.uns[‘rank_features_groups’])

Structured array to be indexed by group id storing the z-score underlying the computation of a p-value for each gene for each group. Ordered according to scores.

logfoldchanges structured np.ndarray (.uns[‘rank_features_groups’])

Structured array to be indexed by group id storing the log2 fold change for each gene for each group. Ordered according to scores. Only provided if method is ‘t-test’ like. Note: this is an approximation calculated from mean-log values.

pvals structured np.ndarray (.uns[‘rank_features_groups’]) p-values. pvals_adj structured np.ndarray (.uns[‘rank_features_groups’]) Corrected p-values. pts: pandas.DataFrame (.uns[‘rank_features_groups’])

Fraction of cells expressing the genes for each group.

pts_rest pandas.DataFrame (.uns[‘rank_features_groups’])

Only if reference is set to ‘rest’. Fraction of observations from the union of the rest of each group containing the features.

Examples:
>>> import ehrapy as ep
>>> adata = ep.dt.mimic_2(encoded=False)
>>> # want to move some metadata to the obs field
>>> ep.anndata.move_to_obs(adata, to_obs=["service_unit", "service_num", "age", "mort_day_censored"])
>>> ep.tl.rank_features_groups(adata, "service_unit")
>>> ep.pl.rank_features_groups(adata)
>>> import ehrapy as ep
>>> adata = ep.dt.mimic_2(encoded=False)
>>> # want to move some metadata to the obs field
>>> ep.anndata.move_to_obs(adata, to_obs=["service_unit", "service_num", "age", "mort_day_censored"])
>>> ep.tl.rank_features_groups(
...     adata, "service_unit", field_to_rank="obs", columns_to_rank={"obs_names": ["age", "mort_day_censored"]}
... )
>>> ep.pl.rank_features_groups(adata)
>>> import ehrapy as ep
>>> adata = ep.dt.mimic_2(encoded=False)
>>> # want to move some metadata to the obs field
>>> ep.anndata.move_to_obs(adata, to_obs=["service_unit", "service_num", "age", "mort_day_censored"])
>>> ep.tl.rank_features_groups(
...     adata,
...     "service_unit",
...     field_to_rank="layer_and_obs",
...     columns_to_rank={"var_names": ["copd_flg", "renal_flg"], "obs_names": ["age", "mort_day_censored"]},
... )
>>> ep.pl.rank_features_groups(adata)