ehrapy.tools.rank_features_groups

ehrapy.tools.rank_features_groups#

ehrapy.tools.rank_features_groups(edata, groupby, *, groups='all', reference='rest', n_features=None, rankby_abs=False, pts=False, key_added='rank_features_groups', copy=False, num_cols_method=None, cat_cols_method='g-test', correction_method='benjamini-hochberg', tie_correct=False, layer=None, field_to_rank='layer', columns_to_rank='all', **kwds)[source]#

Rank features for characterizing groups.

Parameters:
  • edata (EHRData | AnnData) – Central data object.

  • groupby (str) – The key of the observations grouping to consider.

  • groups (Literal['all'] | Iterable[str], default: 'all') – Subset of groups, e.g. [‘g1’, ‘g2’, ‘g3’], to which comparison shall be restricted, or ‘all’ (default), for all groups.

  • reference (str, default: 'rest') – If ‘rest’, compare each group to the union of the rest of the group. If a group identifier, compare with respect to this group.

  • n_features (int | None, default: None) – The number of features that appear in the returned tables. Defaults to all features if None.

  • rankby_abs (bool, default: False) – Rank genes by the absolute value of the score, not by the score. The returned scores are never the absolute values.

  • pts (bool, default: False) – Compute the fraction of observations containing the features.

  • key_added (str | None, default: 'rank_features_groups') – The key in edata.uns information is saved to.

  • copy (bool, default: False) – Whether to return a copy of the data object.

  • num_cols_method (Literal['logreg', 't-test', 'wilcoxon', 't-test_overestim_var'] | None, default: None) – Statistical method to rank numerical features. The default method is ‘t-test’, ‘t-test_overestim_var’ overestimates variance of each group, ‘wilcoxon’ uses Wilcoxon rank-sum, ‘logreg’ uses logistic regression.

  • cat_cols_method (Literal['chi-square', 'g-test', 'freeman-tukey', 'mod-log-likelihood', 'neyman', 'cressie-read'], default: 'g-test') – Statistical method to calculate differences between categorical features. The default method is ‘g-test’, ‘Chi-square’ tests goodness-of-fit test for categorical data, ‘Freeman-Tukey’ tests comparing frequency distributions, ‘Mod-log-likelihood’ maximum likelihood estimation, ‘Neyman’ tests hypotheses using asymptotic theory, ‘Cressie-Read’ is a generalized likelihood test,

  • correction_method (Literal['benjamini-hochberg', 'bonferroni'], default: 'benjamini-hochberg') – p-value correction method. Used only for statistical tests (e.g. doesn’t work for “logreg” num_cols_method)

  • tie_correct (bool, default: False) – Use tie correction for ‘wilcoxon’ scores. Used only for ‘wilcoxon’.

  • layer (str | None, default: None) – Key from edata.layers whose value will be used to perform tests on.

  • field_to_rank (Literal['layer'] | Literal['obs'] | Literal['layer_and_obs'], default: 'layer') – Set to layer to rank variables in edata.X or edata.layers[layer] (default), obs to rank edata.obs, or layer_and_obs to rank both. Layer needs to be None if this is not ‘layer’.

  • columns_to_rank (dict[str, Iterable[str]] | Literal['all'], default: 'all') – Subset of columns to rank. If ‘all’, all columns are used. If a dictionary, it must have keys ‘var_names’ and/or ‘obs_names’ and values must be iterables of strings such as {‘var_names’: [‘glucose’], ‘obs_names’: [‘age’, ‘height’]}.

  • **kwds – Are passed to test methods. Currently, this affects only parameters that are passed to sklearn.linear_model.LogisticRegression. For instance, you can pass penalty=’l1’ to try to come up with a minimal set of genes that are good predictors (sparse solution meaning few non-zero fitted coefficients).

Return type:

None

Returns:

None

The results are stored in edata.uns[‘rank_features_groups’] and include:

  • names (numpy.ndarray): Structured array to be indexed by group id storing the gene names. Ordered according to scores.

  • scores (numpy.ndarray): Structured array to be indexed by group id storing the z-score underlying the computation of a p-value for each gene for each group. Ordered according to scores.

  • logfoldchanges (numpy.ndarray): Structured array to be indexed by group id storing the log2 fold change for each gene for each group. Ordered according to scores. Only provided if method is ‘t-test’ like. Note: this is an approximation calculated from mean-log values.

  • pvals (numpy.ndarray): p-values.

  • pvals_adj (numpy.ndarray): Corrected p-values.

  • pts (pandas.DataFrame): Fraction of cells expressing the genes for each group.

  • pts_rest (pandas.DataFrame): Only if reference is set to ‘rest’. Fraction of observations from the union of the rest of each group containing the features.

Examples

>>> import ehrdata as ed
>>> import ehrapy as ep
>>> edata = ed.dt.mimic_2()
>>> # want to move some metedata to the obs field
>>> ep.anndata.move_to_obs(edata, to_obs=["service_unit", "service_num", "age", "mort_day_censored"])
>>> ep.tl.rank_features_groups(edata, "service_unit")
>>> ep.pl.rank_features_groups(edata)
>>> import ehrdata as ed
>>> import ehrapy as ep
>>> edata = ed.dt.mimic_2()
>>> # want to move some metedata to the obs field
>>> ep.anndata.move_to_obs(edata, to_obs=["service_unit", "service_num", "age", "mort_day_censored"])
>>> ep.tl.rank_features_groups(
...     edata, "service_unit", field_to_rank="obs", columns_to_rank={"obs_names": ["age", "mort_day_censored"]}
... )
>>> ep.pl.rank_features_groups(edata)
>>> import ehrdata as ed
>>> import ehrapy as ep
>>> edata = ed.dt.mimic_2()
>>> # want to move some metedata to the obs field
>>> ep.anndata.move_to_obs(edata, to_obs=["service_unit", "service_num", "age", "mort_day_censored"])
>>> ep.tl.rank_features_groups(
...     edata,
...     "service_unit",
...     field_to_rank="layer_and_obs",
...     columns_to_rank={"var_names": ["copd_flg", "renal_flg"], "obs_names": ["age", "mort_day_censored"]},
... )
>>> ep.pl.rank_features_groups(edata)