ehrapy.tools.rank_features_groups#
- ehrapy.tools.rank_features_groups(edata, groupby, *, groups='all', reference='rest', n_features=None, rankby_abs=False, pts=False, key_added='rank_features_groups', copy=False, num_cols_method=None, cat_cols_method='g-test', correction_method='benjamini-hochberg', tie_correct=False, layer=None, field_to_rank='layer', columns_to_rank='all', **kwds)[source]#
Rank features for characterizing groups.
- Parameters:
groupby (
str) – The key of the observations grouping to consider.groups (
Literal['all'] |Iterable[str], default:'all') – Subset of groups, e.g. [‘g1’, ‘g2’, ‘g3’], to which comparison shall be restricted, or ‘all’ (default), for all groups.reference (
str, default:'rest') – If ‘rest’, compare each group to the union of the rest of the group. If a group identifier, compare with respect to this group.n_features (
int|None, default:None) – The number of features that appear in the returned tables. Defaults to all features if None.rankby_abs (
bool, default:False) – Rank genes by the absolute value of the score, not by the score. The returned scores are never the absolute values.pts (
bool, default:False) – Compute the fraction of observations containing the features.key_added (
str|None, default:'rank_features_groups') – The key in edata.uns information is saved to.copy (
bool, default:False) – Whether to return a copy of the data object.num_cols_method (
Literal['logreg','t-test','wilcoxon','t-test_overestim_var'] |None, default:None) – Statistical method to rank numerical features. The default method is ‘t-test’, ‘t-test_overestim_var’ overestimates variance of each group, ‘wilcoxon’ uses Wilcoxon rank-sum, ‘logreg’ uses logistic regression.cat_cols_method (
Literal['chi-square','g-test','freeman-tukey','mod-log-likelihood','neyman','cressie-read'], default:'g-test') – Statistical method to calculate differences between categorical features. The default method is ‘g-test’, ‘Chi-square’ tests goodness-of-fit test for categorical data, ‘Freeman-Tukey’ tests comparing frequency distributions, ‘Mod-log-likelihood’ maximum likelihood estimation, ‘Neyman’ tests hypotheses using asymptotic theory, ‘Cressie-Read’ is a generalized likelihood test,correction_method (
Literal['benjamini-hochberg','bonferroni'], default:'benjamini-hochberg') – p-value correction method. Used only for statistical tests (e.g. doesn’t work for “logreg” num_cols_method)tie_correct (
bool, default:False) – Use tie correction for ‘wilcoxon’ scores. Used only for ‘wilcoxon’.layer (
str|None, default:None) – Key from edata.layers whose value will be used to perform tests on.field_to_rank (
Literal['layer'] |Literal['obs'] |Literal['layer_and_obs'], default:'layer') – Set to layer to rank variables in edata.X or edata.layers[layer] (default), obs to rank edata.obs, or layer_and_obs to rank both. Layer needs to be None if this is not ‘layer’.columns_to_rank (
dict[str,Iterable[str]] |Literal['all'], default:'all') – Subset of columns to rank. If ‘all’, all columns are used. If a dictionary, it must have keys ‘var_names’ and/or ‘obs_names’ and values must be iterables of strings such as {‘var_names’: [‘glucose’], ‘obs_names’: [‘age’, ‘height’]}.**kwds – Are passed to test methods. Currently, this affects only parameters that are passed to
sklearn.linear_model.LogisticRegression. For instance, you can pass penalty=’l1’ to try to come up with a minimal set of genes that are good predictors (sparse solution meaning few non-zero fitted coefficients).
- Return type:
- Returns:
None
The results are stored in edata.uns[‘rank_features_groups’] and include:
names (
numpy.ndarray): Structured array to be indexed by group id storing the gene names. Ordered according to scores.scores (
numpy.ndarray): Structured array to be indexed by group id storing the z-score underlying the computation of a p-value for each gene for each group. Ordered according to scores.logfoldchanges (
numpy.ndarray): Structured array to be indexed by group id storing the log2 fold change for each gene for each group. Ordered according to scores. Only provided if method is ‘t-test’ like. Note: this is an approximation calculated from mean-log values.pvals (
numpy.ndarray): p-values.pvals_adj (
numpy.ndarray): Corrected p-values.pts (
pandas.DataFrame): Fraction of cells expressing the genes for each group.pts_rest (
pandas.DataFrame): Only if reference is set to ‘rest’. Fraction of observations from the union of the rest of each group containing the features.
Examples
>>> import ehrdata as ed >>> import ehrapy as ep >>> edata = ed.dt.mimic_2() >>> # want to move some metedata to the obs field >>> ed.move_to_obs(edata, ["service_unit", "service_num", "age", "mort_day_censored"]) >>> ep.tl.rank_features_groups(edata, "service_unit") >>> ep.pl.rank_features_groups(edata)
>>> import ehrdata as ed >>> import ehrapy as ep >>> edata = ed.dt.mimic_2() >>> # want to move some metedata to the obs field >>> ed.move_to_obs(edata, ["service_unit", "service_num", "age", "mort_day_censored"]) >>> ep.tl.rank_features_groups( ... edata, "service_unit", field_to_rank="obs", columns_to_rank={"obs_names": ["age", "mort_day_censored"]} ... ) >>> ep.pl.rank_features_groups(edata)
>>> import ehrdata as ed >>> import ehrapy as ep >>> edata = ed.dt.mimic_2() >>> # want to move some metedata to the obs field >>> ed.move_to_obs(edata, ["service_unit", "service_num", "age", "mort_day_censored"]) >>> ep.tl.rank_features_groups( ... edata, ... "service_unit", ... field_to_rank="layer_and_obs", ... columns_to_rank={"var_names": ["copd_flg", "renal_flg"], "obs_names": ["age", "mort_day_censored"]}, ... ) >>> ep.pl.rank_features_groups(edata)