ehrapy.preprocessing.explicit_impute

ehrapy.preprocessing.explicit_impute#

ehrapy.preprocessing.explicit_impute(edata, replacement, *, layer=None, impute_empty_strings=True, warning_threshold=70, copy=False)[source]#

Replaces all missing values in all columns or a subset of columns specified by the user with the passed replacement value.

There are three scenarios to cover: 1. Replace all missing values with the specified value. 2. Replace all missing values in a subset of columns with a specified value per column. 3. Replace all missing values with a different value per timepoint.

Parameters:
  • edata (EHRData | AnnData) – Central data object.

  • replacement (str | int | float | Mapping[str, str | int | float] | Sequence[str | int | float]) – The value to replace missing values with. If a dictionary is provided, the keys represent column names and the values represent replacement values for those columns. If a list with a length of timepoints is provided, the index of the list represent the timepoint and the values represent the replacement value for the respective timepoint.

  • layer (str | None, default: None) – The layer to impute.

  • impute_empty_strings (bool, default: True) – If True, empty strings are also replaced.

  • warning_threshold (int, default: 70) – Threshold of percentage of missing values to display a warning for.

  • copy (bool, default: False) – If True, returns a modified copy of the original data object. If False, modifies the object in place.

Return type:

EHRData | AnnData | None

Returns:

If copy is True, a modified copy of the original data object with imputed X. If copy is False, the original data object is modified in place, and None is returned.

Examples

Replace all missing values in edata with the value 0:

>>> import ehrdata as ed
>>> import ehrapy as ep
>>> edata = ed.dt.mimic_2()
>>> ep.pp.explicit_impute(edata, replacement=0)

Replace all missing values in the first timepoint with 1 and all missing values in second timepoint with 2:

>>> import ehrdata as ed
>>> import ehrapy as ep
>>> edata = ed.dt.ehrdata_blobs(n_variables=10, n_observations=10, base_timepoints=2, missing_values=0.5)
>>> ep.pp.explicit_impute(edata, replacement=[1, 2], layer="tem_data")

Example Output:

>>> edata.layers["tem_data"][0, :, 0]
[ 1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
0.021176  , -5.25906637,  1.        ,  1.        ,  1.        ]
>>> edata.layers["tem_data"][0, :, 1]
[ 2.        , 10.30041167, -3.6883699 ,  2.        ,  2.        ,
0.09374899,  2.        , -3.77042107,  2.        ,  2.45151241]