ehrapy.preprocessing.balanced_sample

ehrapy.preprocessing.balanced_sample(adata, *, key, random_state=0, method='RandomUnderSampler', sampler_kwargs=None, copy=False)[source]

Balancing groups in the dataset.

Balancing groups in the dataset based on group members in .obs[key] using the imbalanced-learn package. Currently, supports RandomUnderSampler and RandomOverSampler.

Note that RandomOverSampler only replicates observations of the minority groups, which distorts several downstream analyses, very prominently neighborhood calculations and downstream analyses depending on that. The RandomUnderSampler by default undersamples the majority group without replacement, not causing this issues of replicated observations.

Parameters:
  • adata (AnnData) – The annotated data matrix of shape n_obs × n_vars.

  • key (str) – The key in adata.obs that contains the group information.

  • random_state (int) – Random seed. Defaults to 0.

  • method (Literal['RandomUnderSampler', 'RandomOverSampler']) – The method to use for balancing. Defaults to “RandomUnderSampler”.

  • sampler_kwargs (dict) – Keyword arguments for the sampler, see the imbalanced-learn documentation for options. Defaults to None.

  • copy (bool) – If True, return a copy of the balanced data. Defaults to False.

Return type:

AnnData

Returns:

A new AnnData object, with the balanced groups.

Examples

>>> import ehrapy as ep
>>> adata = ep.data.diabetes_130_fairlearn(columns_obs_only=["age"])
>>> adata.obs.age.value_counts()
age
'Over 60 years'          68541
'30-60 years'            30716
'30 years or younger'     2509
>>> adata_balanced = ep.pp.sample(adata, key="age")
>>> adata_balanced.obs.age.value_counts()
age
'30 years or younger'    2509
'30-60 years'            2509
'Over 60 years'          2509