ehrapy.preprocessing.filter_features

ehrapy.preprocessing.filter_features#

ehrapy.preprocessing.filter_features(edata, *, layer=None, min_obs=None, max_obs=None, time_mode='all', prop=None, copy=False)[source]#

Filter features based on missing data thresholds.

Keep only features which have at least min_obs observations and/or have at most max_obs observations. An observation is considered non-missing if it contains a valid (non-NaN / non-null) value.

When a longitudinal EHRData is passed, filtering can be done across time points according to the specific time_mode.

Only provide one of min_obs and/or max_obs.

Parameters:
  • edata (EHRData) – Central data object.

  • layer (str | None, default: None) – layer to use for filtering. If None (default), filtering is done on .X.

  • min_obs (int | None, default: None) – Minimum number of observations required for a feature to pass filtering.

  • max_obs (int | None, default: None) – Maximum number of observations allowed for a feature to pass filtering.

  • time_mode (Literal['all', 'any', 'proportion'], default: 'all') –

    How to combine filtering criteria across the time axis. Use it only with 3 dimensional EHRData obejcts. Options are:

    • ’all’ (default): The feature must pass the filtering criteria in all time points.

    • ’any’: The feature must pass the filtering criteria in at least one time point.

    • ’proportion’: The feature must pass the filtering criteria in at least a proportion prop of time points.

      For example, with prop=0.3, the feature must pass the filtering criteria in at least 30% of the time points.

  • prop (float | None, default: None) – Proportion of time points in which the feature must pass the filtering criteria. Only relevant if time_mode=’proportion’.

  • copy (bool, default: False) – Determines whether a copy is returned.

Return type:

EHRData | None

Returns:

Depending on copy, subsets and annotates the passed data object and returns a filtered copy of the data object or acts in place

Examples

>>> import ehrapy as ep
>>> edata = ed.dt.ehrdata_blobs(
...     n_variables=45, n_observations=500, base_timepoints=15, missing_values=0.6, layer="tem_data"
... )
>>> edata.layers["tem_data"].shape
(500, 45, 15)
>>> ep.pp.filter_features(edata, min_obs=185, time_mode="all", layer="tem_data")
>>> edata.layers["tem_data"].shape
(500, 18, 15)