ehrapy.preprocessing.miss_forest_impute

ehrapy.preprocessing.miss_forest_impute(adata, var_names=None, *, num_initial_strategy='mean', max_iter=3, n_estimators=100, random_state=0, warning_threshold=70, copy=False)[source]

Impute data using the MissForest strategy.

This function uses the MissForest strategy to impute missing values in the data matrix of an AnnData object. The strategy works by fitting a random forest model on each feature containing missing values, and using the trained model to predict the missing values.

See https://academic.oup.com/bioinformatics/article/28/1/112/219101.

If required, the data needs to be properly encoded as this imputation requires numerical data only.

Parameters:
  • adata (AnnData) – The AnnData object to use MissForest Imputation on.

  • var_names (Iterable[str] | None, default: None) – Iterable of columns to impute

  • num_initial_strategy (Literal['mean', 'median', 'most_frequent', 'constant'], default: 'mean') – The initial strategy to replace all missing numerical values with.

  • max_iter (int, default: 3) – The maximum number of iterations if the stop criterion has not been met yet.

  • n_estimators (int, default: 100) – The number of trees to fit for every missing variable. Has a big effect on the run time. Decrease for faster computations.

  • random_state (int, default: 0) – The random seed for the initialization.

  • warning_threshold (int, default: 70) – Threshold of percentage of missing values to display a warning for.

  • copy (bool, default: False) – Whether to return a copy or act in place.

Return type:

AnnData | None

Returns:

If copy is True, a modified copy of the original AnnData object with imputed X. If copy is False, the original AnnData object is modified in place, and None is returned.

Examples

>>> import ehrapy as ep
>>> adata = ep.dt.mimic_2(encoded=True)
>>> ep.pp.miss_forest_impute(adata)