ehrapy.preprocessing.miss_forest_impute#
- ehrapy.preprocessing.miss_forest_impute(edata, var_names=None, *, num_initial_strategy='mean', max_iter=3, n_estimators=100, random_state=0, warning_threshold=70, layer=None, copy=False)[source]#
Impute data using the MissForest strategy.
This function uses the MissForest strategy to impute missing values in the data matrix of an data object. The strategy works by fitting a random forest model on each feature containing missing values, and using the trained model to predict the missing values.
For 2D data, if layer is None, edata.X is used directly. For 3D data, the layer is flattened along axis 0 before imputation and reshaped back to 3D afterwards.
See https://academic.oup.com/bioinformatics/article/28/1/112/219101.
If required, the data needs to be properly encoded as this imputation requires numerical data only.
- Parameters:
edata (
EHRData) – Central data object.var_names (
Iterable[str] |None, default:None) – Iterable of columns to imputenum_initial_strategy (
Literal['mean','median','most_frequent','constant'], default:'mean') – The initial strategy to replace all missing numerical values with.max_iter (
int, default:3) – The maximum number of iterations if the stop criterion has not been met yet.n_estimators (
int, default:100) – The number of trees to fit for every missing variable. Has a big effect on the run time. Decrease for faster computations.random_state (
int, default:0) – The random seed for the initialization.warning_threshold (
int, default:70) – Threshold of percentage of missing values to display a warning for.layer (
str|None, default:None) – The layer to impute. Required when input data is 3D.copy (
bool, default:False) – Whether to return a copy or act in place.
- Return type:
- Returns:
If copy is True, a modified copy of the original data object with imputed X. If copy is False, the original data object is modified in place, and None is returned.
Examples
>>> import ehrdata as ed >>> import ehrapy as ep >>> edata = ed.dt.ehrdata_blobs(n_variables=3, n_observations=3, base_timepoints=2, missing_values=0.3) >>> edata_imputed = ep.pp.knn_impute(edata, layer="tem_data", copy=True)
Example Output:
>>> edata.layers["tem_data"][0, :, :] [[-12.12732884, -18.37304373], [ nan, -0.91339411], [ nan, -7.88514984]] >>> edata_imputed.layers["tem_data"][0, :, :] [[-12.12732884, -18.37304373], [ -0.3278448 , -0.91339411], [ -4.39722201, -7.88514984]]