ehrapy.preprocessing.sample

Contents

ehrapy.preprocessing.sample#

ehrapy.preprocessing.sample(data, fraction=None, *, n_obs=None, rng=None, balanced=False, balanced_method='RandomUnderSampler', balanced_key=None, copy=False, replace=False, axis='obs', p=None)[source]#

Sample a fraction or a number of observations / variables with or without replacement.

Parameters:
  • data (EHRData | AnnData | ndarray | csr_matrix | csc_matrix) – Central data object.

  • fraction (float | None, default: None) – Sample to this fraction of the number of observations.

  • n_obs (int | None, default: None) – Sample to this number of observations.

  • rng (Generator | BitGenerator | int | integer | Sequence[int] | SeedSequence | None, default: None) – Random seed.

  • copy (bool, default: False) – If an AnnData is passed, determines whether a copy is returned.

  • balanced (bool, default: False) – If True, balance the groups in adata.obs[key] by under- or over-sampling. Requires key to be set. If False, simple random sampling is performed.

  • balanced_method (Literal['RandomUnderSampler', 'RandomOverSampler'], default: 'RandomUnderSampler') – The sampling method, either “RandomUnderSampler” for under-sampling or “RandomOverSampler” for over-sampling. Only relevant if balanced=True.

  • balanced_key (str | None, default: None) – Key in adata.obs to use for balancing the groups. Only relevant if balanced=True.

  • replace (bool, default: False) – If True, samples are drawn with replacement. Only relevant if balanced=False.

  • axis (Literal['obs', 0, 'var', 1], default: 'obs') – Axis to sample on. Either obs / 0 (observations, default) or var / 1 (variables).

  • p (str | ndarray[tuple[Any, ...], dtype[bool]] | ndarray[tuple[Any, ...], dtype[floating]] | None, default: None) – Drawing probabilities (floats) or mask (bools). Either an axis-sized array, or the name of a column If p is an array of probabilities, it must sum to 1.

Return type:

EHRData | AnnData | None | tuple[ndarray | csr_matrix | csc_matrix, ndarray]

Returns:

Returns X[obs_indices], obs_indices if data is array-like, otherwise subsamples the passed Central data object (copy == False) or returns a subsampled copy of it (copy == True).

Examples

>>> import ehrapy as ep
>>> edata = ed.dt.diabetes_130_fairlearn(columns_obs_only=["age"])
>>> edata.obs.age.value_counts()
age
'Over 60 years'          68541
'30-60 years'            30716
'30 years or younger'     2509
>>> edata_balanced = ep.pp.sample(
...     edata, balanced=True, balanced_method="RandomUnderSampler", balanced_key="age", copy=True
... )
>>> edata_balanced.obs.age.value_counts()
 age
'30 years or younger'    2509
'30-60 years'            2509
'Over 60 years'          2509