ehrapy.preprocessing.miss_forest_impute

ehrapy.preprocessing.miss_forest_impute(adata, var_names=None, *, num_initial_strategy='mean', max_iter=3, n_estimators=100, random_state=0, warning_threshold=70, copy=False)[source]

Impute data using the MissForest strategy.

This function uses the MissForest strategy to impute missing values in the data matrix of an AnnData object. The strategy works by fitting a random forest model on each feature containing missing values, and using the trained model to predict the missing values.

See https://academic.oup.com/bioinformatics/article/28/1/112/219101. This requires the computation of which columns in X contain numerical only (including NaNs) and which contain non-numerical data.

Parameters:
  • adata (AnnData) – The AnnData object to use MissForest Imputation on.

  • var_names (dict[str, list[str]] | list[str] | None) – List of columns to impute or a dict with two keys (‘numerical’ and ‘non_numerical’) indicating which var contain mixed data and which numerical data only.

  • num_initial_strategy (Literal['mean', 'median', 'most_frequent', 'constant']) – The initial strategy to replace all missing numerical values with. Defaults to ‘mean’.

  • max_iter (int) – The maximum number of iterations if the stop criterion has not been met yet. Defaults to 3.

  • n_estimators – The number of trees to fit for every missing variable. Has a big effect on the run time. Decrease for faster computations. Defaults to 100.

  • random_state (int) – The random seed for the initialization. Defaults to 0.

  • warning_threshold (int) – Threshold of percentage of missing values to display a warning for. Defaults to 70 .

  • copy (bool) – Whether to return a copy or act in place. Defaults to False.

Return type:

AnnData

Returns:

The imputed (but unencoded) AnnData object.

Examples

>>> import ehrapy as ep
>>> adata = ep.dt.mimic_2(encoded=True)
>>> ep.pp.miss_forest_impute(adata)