ehrapy.preprocessing.mice_forest_impute

ehrapy.preprocessing.mice_forest_impute#

ehrapy.preprocessing.mice_forest_impute(edata, var_names=None, *, warning_threshold=70, save_all_iterations_data=True, random_state=None, iterations=5, variable_parameters=None, verbose=False, layer=None, copy=False)[source]#

Impute data using the miceforest method.

See AnotherSamWilson/miceforest Fast, memory efficient Multiple Imputation by Chained Equations (MICE) with lightgbm.

If required, the data needs to be properly encoded as this imputation requires numerical data only.

For 2D data, if layer is None, edata.X is used directly. For 3D data, the layer is flattened along axis 0 before imputation and reshaped back to 3D afterwards.

Warning

This function is not supported on MacOS.

Parameters:
  • edata (EHRData) – Central data object.

  • var_names (Iterable[str] | None, default: None) – A list of variable names to impute. If None, impute all variables.

  • warning_threshold (int, default: 70) – Threshold of percentage of missing values to display a warning for.

  • save_all_iterations_data (bool, default: True) – Whether to save all imputed values from all iterations or just the latest. Saving all iterations allows for additional plotting, but may take more memory.

  • random_state (int | None, default: None) – The random state ensures script reproducibility.

  • iterations (int, default: 5) – The number of iterations to run.

  • variable_parameters (dict | None, default: None) – Model parameters can be specified by variable here. Keys should be variable names or indices, and values should be a dict of parameter which should apply to that variable only.

  • verbose (bool, default: False) – Whether to print information about the imputation process.

  • layer (str | None, default: None) – The layer to impute. Required when input data is 3D.

  • copy (bool, default: False) – Whether to return a copy of the data object or modify it in-place.

Return type:

EHRData | None

Returns:

If copy is True, a modified copy of the original data object with imputed X. If copy is False, the original data object is modified in place, and None is returned.

Examples

>>> import ehrdata as ed
>>> import ehrapy as ep
>>> edata = ed.dt.ehrdata_blobs(n_variables=3, n_observations=20, base_timepoints=2, missing_values=0.3)
>>> edata_imputed = ep.pp.mice_forest_impute(edata, layer="tem_data", copy=True)

Example Output:

>>> edata.layers["tem_data"][0, :, :]
[[-11.3735387 , -17.00612946],
[         nan,  -3.13348925],
[         nan, -10.87061402]]
>>> edata_imputed.layers["tem_data"][0, :, :]
[[-11.3735387 , -17.00612946],
[ -2.29990557,  -3.13348925],
[ -6.72812888, -10.87061402]]