ehrapy.preprocessing.knn_impute#
- ehrapy.preprocessing.knn_impute(edata, var_names=None, *, n_neighbors=5, layer=None, copy=False, backend='faiss', warning_threshold=70, backend_kwargs=None, **kwargs)[source]#
Imputes missing values in the input data object using K-nearest neighbor imputation.
If required, the data needs to be properly encoded as this imputation requires numerical data only. For 2D data, if layer is None, edata.X is used directly. For 3D data, the layer is flattened along axis 0 before imputation and reshaped back to 3D afterwards.
Warning
Currently, both n_neighbours and n_neighbors are accepted as parameters for the number of neighbors. However, in future versions, only n_neighbors will be supported. Please update your code accordingly.
- Parameters:
var_names (
Iterable[str] |None, default:None) – A list of variable names indicating which columns to impute. If None, all columns are imputed. Default is None.n_neighbors (
int, default:5) – Number of neighbors to use when performing the imputation.layer (
str|None, default:None) – The layer to impute. Required when the input data is 3D.copy (
bool, default:False) – Whether to perform the imputation on a copy of the original data object. If True, the original object remains unmodified.backend (
Literal['scikit-learn','faiss'], default:'faiss') – The implementation to use for the KNN imputation. ‘scikit-learn’ is very slow but uses an exact KNN algorithm, whereas ‘faiss’ is drastically faster but uses an approximation for the KNN graph. In practice, ‘faiss’ is close enough to the ‘scikit-learn’ results.warning_threshold (
int, default:70) – Percentage of missing values above which a warning is issued.backend_kwargs (
dict|None, default:None) – Passed to the backend. Pass “mean”, “median”, or “weighted” for ‘strategy’ to set the imputation strategy for faiss. See sklearn.impute.KNNImputer for more information on the ‘scikit-learn’ backend. See fknni.faiss.FaissImputer for more information on the ‘faiss’ backend.kwargs – Gathering keyword arguments of earlier ehrapy versions for backwards compatibility. It is encouraged to use the here listed, current arguments.
- Return type:
- Returns:
If copy is True, a modified copy of the original data object with imputed X. If copy is False, the original data object is modified in place, and None is returned.
Examples
>>> import ehrdata as ed >>> import ehrapy as ep >>> edata_3d = ed.dt.ehrdata_blobs(n_variables=3, n_observations=3, base_timepoints=2, missing_values=0.3) >>> edata_imputed = ep.pp.knn_impute(edata_3d, layer="tem_data", copy=True)
Example Output:
>>> edata_3d.layers["tem_data"][0, :, :] [[-12.12732884, -18.37304373], [ nan, -0.91339411], [ nan, -7.88514984]] >>> edata_imputed.layers["tem_data"][0, :, :] [[-12.12732884, -18.37304373], [ -0.07689509, -0.91339411], [ -2.75584421, -7.88514984]]