ehrapy.preprocessing.filter_observations

ehrapy.preprocessing.filter_observations#

ehrapy.preprocessing.filter_observations(edata, *, layer=None, min_vars=None, max_vars=None, time_mode='all', prop=None, copy=False)[source]#

Filter observations based on missing data thresholds (features/measurements).

Keep only observations which have at least min_vars variables and/or at most max_vars variables. An observation is considered non-missing if it contains a valid (non-NaN / non-null) value. When a longitudinal EHRData is passed, filtering can be done across time points.

Only provide one of min_vars and/or max_vars.

Parameters:
  • edata (EHRData) – Central data object.

  • layer (str | None, default: None) – layer to use for filtering. If None (default), filtering is done on .X.

  • min_vars (int | None, default: None) – Minimum number of variables required for an observation to pass filtering.

  • max_vars (int | None, default: None) – Maximum number of variables allowed for an observation to pass filtering.

  • time_mode (Literal['all', 'any', 'proportion'], default: 'all') –

    How to combine filtering criteria across the time axis. Only relevant if an EHRData is passed. Options are:

    • ’all’ (default): The observation must pass the filtering criteria in all time points.

    • ’any’: The observation must pass the filtering criteria in at least one time point.

    • ’proportion’: The observation must pass the filtering criteria in at least a proportion prop of time points.

      For example, with prop=0.3, the observation must pass the filtering criteria in at least 30% of the time points.

  • prop (float | None, default: None) – Proportion of time points in which the observation must pass the filtering criteria. Only relevant if time_mode=’proportion’.

  • copy (bool, default: False) – Determines whether a copy is returned.

Return type:

EHRData | None

Returns:

Depending on copy, subsets and annotates the passed data object and returns a filtered copy of the data object or acts in place

Examples

>>> import ehrapy as ep
>>> edata = ed.dt.ehrdata_blobs(
...     n_variables=45, n_observations=500, base_timepoints=15, missing_values=0.6, layer="tem_data"
... )
>>> edata.layers["tem_data"].shape
(500, 45, 15)
>>> ep.pp.filter_observations(edata, min_vars=10, time_mode="all", layer="tem_data")
>>> edata.layers["tem_data"].shape
(477, 45, 15)