ehrapy.preprocessing.locf_impute

Contents

ehrapy.preprocessing.locf_impute#

ehrapy.preprocessing.locf_impute(edata, var_names=None, *, layer=None, fallback_method='mean', copy=False)[source]#

Impute missing values by carrying forward the last observed value along the time axis.

Implements Last Observation Carried Forward (LOCF) for longitudinal (3D) data. For each patient and feature, missing values are replaced with the most recent non-missing value. Missing values that occur before any observation for a given patient are filled using a fallback method.

Parameters:
  • edata (EHRData) – Central data object.

  • var_names (Iterable[str] | None, default: None) – A list of column names to apply imputation on (if None, impute all columns).

  • layer (str | None, default: None) – The layer to impute. Must contain 3D data of shape (n_obs, n_vars, n_time).

  • fallback_method (Literal['mean', 'median', 'most_frequent', 'bfill'] | None, default: 'mean') – Method for imputing values before the first observation per patient. 'mean' fills with the per-feature mean, 'median' fills with the per-feature median, 'most_frequent' fills with the per-feature most frequent value (all computed from the original data, before forward filling), 'bfill' fills with each patient’s first observed value (backward fill), and None leaves remaining NaN values untouched.

  • copy (bool, default: False) – Whether to return a copy of edata or modify it inplace.

Return type:

EHRData | None

Returns:

If copy is True, a modified copy of the original data object with imputed data. If copy is False, the original data object is modified in place, and None is returned.

Raises:

ValueError – If the data is not 3D or if an unsupported fallback_method is specified.

Examples

>>> import numpy as np
>>> import ehrdata as ed
>>> import ehrapy as ep
>>> data = np.array(
...     [
...         [[1.0, np.nan, 3.0, np.nan], [np.nan, 2.0, np.nan, 4.0], [5.0, 6.0, 7.0, 8.0]],
...         [[np.nan, np.nan, 3.0, np.nan], [1.0, np.nan, np.nan, np.nan], [np.nan, 2.0, np.nan, 4.0]],
...     ]
... )
>>> edata = ed.EHRData(shape=(2, 3), layers={"tem_data": data})
>>> ep.pp.locf_impute(edata, layer="tem_data")
>>> edata.layers["tem_data"]
array([[[1.        , 1.        , 3.        , 3.        ],
        [2.33, 2.        , 2.        , 4.        ],
        [5.        , 6.        , 7.        , 8.        ]],

       [[2.33, 2.33, 3.        , 3.        ],
        [1.        , 1.        , 1.        , 1.        ],
        [5.33, 2.        , 2.        , 4.        ]]])