ehrapy.preprocessing.sample#
- ehrapy.preprocessing.sample(data, fraction=None, *, n_obs=None, rng=None, balanced=False, balanced_method='RandomUnderSampler', balanced_key=None, copy=False, replace=False, axis='obs', p=None)[source]#
Sample a fraction or a number of observations / variables with or without replacement.
- Parameters:
data (
EHRData|ndarray|csr_array|csc_array) – Central data object.fraction (
float|None, default:None) – Sample to this fraction of the number of observations.n_obs (
int|None, default:None) – Sample to this number of observations.rng (
Generator|BitGenerator|int|integer|Sequence[int] |SeedSequence|None, default:None) – Random seed.copy (
bool, default:False) – If anEHRDatais passed, determines whether a copy is returned.balanced (
bool, default:False) – If True, balance the groups in adata.obs[key] by under- or over-sampling. Requires key to be set. If False, simple random sampling is performed.balanced_method (
Literal['RandomUnderSampler','RandomOverSampler'], default:'RandomUnderSampler') – The sampling method, either “RandomUnderSampler” for under-sampling or “RandomOverSampler” for over-sampling. Only relevant if balanced=True.balanced_key (
str|None, default:None) – Key in adata.obs to use for balancing the groups. Only relevant if balanced=True.replace (
bool, default:False) – If True, samples are drawn with replacement. Only relevant if balanced=False.axis (
Literal['obs',0,'var',1], default:'obs') – Axis to sample on. Either obs / 0 (observations, default) or var / 1 (variables).p (
str|ndarray[tuple[Any,...],dtype[bool]] |ndarray[tuple[Any,...],dtype[floating]] |None, default:None) – Drawing probabilities (floats) or mask (bools). Either an axis-sized array, or the name of a column If p is an array of probabilities, it must sum to 1.
- Return type:
EHRData|None|tuple[ndarray|csr_array|csc_array,ndarray]- Returns:
Returns X[obs_indices], obs_indices if data is array-like, otherwise subsamples the passed Central data object (copy == False) or returns a subsampled copy of it (copy == True).
Examples
>>> import ehrapy as ep >>> edata = ed.dt.diabetes_130_fairlearn(columns_obs_only=["age"]) >>> edata.obs.age.value_counts() age 'Over 60 years' 68541 '30-60 years' 30716 '30 years or younger' 2509 >>> edata_balanced = ep.pp.sample( ... edata, balanced=True, balanced_method="RandomUnderSampler", balanced_key="age", copy=True ... ) >>> edata_balanced.obs.age.value_counts() age '30 years or younger' 2509 '30-60 years' 2509 'Over 60 years' 2509