ehrapy.preprocessing.miss_forest_impute¶
- ehrapy.preprocessing.miss_forest_impute(adata, var_names=None, *, num_initial_strategy='mean', max_iter=3, n_estimators=100, random_state=0, warning_threshold=70, copy=False)[source]¶
Impute data using the MissForest strategy.
This function uses the MissForest strategy to impute missing values in the data matrix of an AnnData object. The strategy works by fitting a random forest model on each feature containing missing values, and using the trained model to predict the missing values.
See https://academic.oup.com/bioinformatics/article/28/1/112/219101. This requires the computation of which columns in X contain numerical only (including NaNs) and which contain non-numerical data.
- Parameters:
adata (
AnnData) – The AnnData object to use MissForest Imputation on.var_names (
dict[str,list[str]] |list[str] |None, default:None) – List of columns to impute or a dict with two keys (‘numerical’ and ‘non_numerical’) indicating which var contain mixed data and which numerical data only.num_initial_strategy (
Literal['mean','median','most_frequent','constant'], default:'mean') – The initial strategy to replace all missing numerical values with.max_iter (
int, default:3) – The maximum number of iterations if the stop criterion has not been met yet.n_estimators (default:
100) – The number of trees to fit for every missing variable. Has a big effect on the run time. Decrease for faster computations.random_state (
int, default:0) – The random seed for the initialization.warning_threshold (
int, default:70) – Threshold of percentage of missing values to display a warning for.copy (
bool, default:False) – Whether to return a copy or act in place.
- Return type:
- Returns:
The imputed (but unencoded) AnnData object.
Examples
>>> import ehrapy as ep >>> adata = ep.dt.mimic_2(encoded=True) >>> ep.pp.miss_forest_impute(adata)