ehrapy.tools.rank_features_groups¶
- ehrapy.tools.rank_features_groups(adata, groupby, groups='all', reference='rest', n_features=None, rankby_abs=False, pts=False, key_added='rank_features_groups', copy=False, num_cols_method=None, cat_cols_method='g-test', correction_method='benjamini-hochberg', tie_correct=False, layer=None, field_to_rank='layer', columns_to_rank='all', **kwds)[source]¶
Rank features for characterizing groups.
- Parameters:
adata (
AnnData
) – Annotated data matrix.groupby (
str
) – The key of the observations grouping to consider.groups (
Union
[Literal
['all'
],Iterable
[str
]], default:'all'
) – Subset of groups, e.g. [‘g1’, ‘g2’, ‘g3’], to which comparison shall be restricted, or ‘all’ (default), for all groups.reference (
str
, default:'rest'
) – If ‘rest’, compare each group to the union of the rest of the group. If a group identifier, compare with respect to this group.n_features (
int
|None
, default:None
) – The number of features that appear in the returned tables. Defaults to all features if None.rankby_abs (
bool
, default:False
) – Rank genes by the absolute value of the score, not by the score. The returned scores are never the absolute values.pts (
bool
, default:False
) – Compute the fraction of observations containing the features.key_added (
str
|None
, default:'rank_features_groups'
) – The key in adata.uns information is saved to.copy (
bool
, default:False
) – Whether to return a copy of the AnnData object.num_cols_method (
Optional
[Literal
['logreg'
,'t-test'
,'wilcoxon'
,'t-test_overestim_var'
]], default:None
) – Statistical method to rank numerical features. The default method is ‘t-test’, ‘t-test_overestim_var’ overestimates variance of each group, ‘wilcoxon’ uses Wilcoxon rank-sum, ‘logreg’ uses logistic regression.cat_cols_method (
Literal
['chi-square'
,'g-test'
,'freeman-tukey'
,'mod-log-likelihood'
,'neyman'
,'cressie-read'
], default:'g-test'
) – Statistical method to calculate differences between categorical features. The default method is ‘g-test’, ‘Chi-square’ tests goodness-of-fit test for categorical data, ‘Freeman-Tukey’ tests comparing frequency distributions, ‘Mod-log-likelihood’ maximum likelihood estimation, ‘Neyman’ tests hypotheses using asymptotic theory, ‘Cressie-Read’ is a generalized likelihood test,correction_method (
Literal
['benjamini-hochberg'
,'bonferroni'
], default:'benjamini-hochberg'
) – p-value correction method. Used only for statistical tests (e.g. doesn’t work for “logreg” num_cols_method)tie_correct (
bool
, default:False
) – Use tie correction for ‘wilcoxon’ scores. Used only for ‘wilcoxon’.layer (
str
|None
, default:None
) – Key from adata.layers whose value will be used to perform tests on.field_to_rank (
Union
[Literal
['layer'
],Literal
['obs'
],Literal
['layer_and_obs'
]], default:'layer'
) – Set to layer to rank variables in adata.X or adata.layers[layer] (default), obs to rank adata.obs, or layer_and_obs to rank both. Layer needs to be None if this is not ‘layer’.columns_to_rank (
Union
[dict
[str
,Iterable
[str
]],Literal
['all'
]], default:'all'
) – Subset of columns to rank. If ‘all’, all columns are used. If a dictionary, it must have keys ‘var_names’ and/or ‘obs_names’ and values must be iterables of strings such as {‘var_names’: [‘glucose’], ‘obs_names’: [‘age’, ‘height’]}.**kwds – Are passed to test methods. Currently, this affects only parameters that are passed to
sklearn.linear_model.LogisticRegression
. For instance, you can pass penalty=’l1’ to try to come up with a minimal set of genes that are good predictors (sparse solution meaning few non-zero fitted coefficients).
- Return type:
- Returns:
- names structured np.ndarray (.uns[‘rank_features_groups’])
Structured array to be indexed by group id storing the gene names. Ordered according to scores.
- scores structured np.ndarray (.uns[‘rank_features_groups’])
Structured array to be indexed by group id storing the z-score underlying the computation of a p-value for each gene for each group. Ordered according to scores.
- logfoldchanges structured np.ndarray (.uns[‘rank_features_groups’])
Structured array to be indexed by group id storing the log2 fold change for each gene for each group. Ordered according to scores. Only provided if method is ‘t-test’ like. Note: this is an approximation calculated from mean-log values.
pvals structured np.ndarray (.uns[‘rank_features_groups’]) p-values. pvals_adj structured np.ndarray (.uns[‘rank_features_groups’]) Corrected p-values. pts: pandas.DataFrame (.uns[‘rank_features_groups’])
Fraction of cells expressing the genes for each group.
- pts_rest pandas.DataFrame (.uns[‘rank_features_groups’])
Only if reference is set to ‘rest’. Fraction of observations from the union of the rest of each group containing the features.
- Examples:
>>> import ehrapy as ep >>> adata = ep.dt.mimic_2(encoded=False) >>> # want to move some metadata to the obs field >>> ep.anndata.move_to_obs(adata, to_obs=["service_unit", "service_num", "age", "mort_day_censored"]) >>> ep.tl.rank_features_groups(adata, "service_unit") >>> ep.pl.rank_features_groups(adata)
>>> import ehrapy as ep >>> adata = ep.dt.mimic_2(encoded=False) >>> # want to move some metadata to the obs field >>> ep.anndata.move_to_obs(adata, to_obs=["service_unit", "service_num", "age", "mort_day_censored"]) >>> ep.tl.rank_features_groups( ... adata, "service_unit", field_to_rank="obs", columns_to_rank={"obs_names": ["age", "mort_day_censored"]} ... ) >>> ep.pl.rank_features_groups(adata)
>>> import ehrapy as ep >>> adata = ep.dt.mimic_2(encoded=False) >>> # want to move some metadata to the obs field >>> ep.anndata.move_to_obs(adata, to_obs=["service_unit", "service_num", "age", "mort_day_censored"]) >>> ep.tl.rank_features_groups( ... adata, ... "service_unit", ... field_to_rank="layer_and_obs", ... columns_to_rank={"var_names": ["copd_flg", "renal_flg"], "obs_names": ["age", "mort_day_censored"]}, ... ) >>> ep.pl.rank_features_groups(adata)