ehrapy.preprocessing.simple_impute

Contents

ehrapy.preprocessing.simple_impute#

ehrapy.preprocessing.simple_impute(edata, var_names=None, *, strategy='mean', warning_threshold=70, layer=None, copy=False)[source]#

Impute missing values in numerical data using mean/median/most frequent imputation.

If required and using mean or median strategy, the data needs to be properly encoded as this imputation requires numerical data only.

Parameters:
  • edata (EHRData | AnnData) – Central data object.

  • var_names (Iterable[str] | None, default: None) – A list of column names to apply imputation on (if None, impute all columns).

  • strategy (Literal['mean', 'median', 'most_frequent'], default: 'mean') – Imputation strategy to use. One of {‘mean’, ‘median’, ‘most_frequent’}. If data is a dask.array.Array, only ‘mean’ is supported.

  • warning_threshold (int, default: 70) – Display a warning message if percentage of missing values exceeds this threshold.

  • layer (str | None, default: None) – The layer to impute.

  • copy (bool, default: False) – Whether to return a copy of edata or modify it inplace.

Return type:

EHRData | AnnData | None

Returns:

If copy is True, a modified copy of the original data object with imputed X. If copy is False, the original data object is modified in place, and None is returned.

Examples

>>> import ehrdata as ed
>>> import ehrapy as ep
>>> edata = ed.dt.mimic_2()
>>> ep.pp.simple_impute(edata, strategy="median")