ehrapy.preprocessing.filter_observations#
- ehrapy.preprocessing.filter_observations(edata, *, layer=None, min_vars=None, max_vars=None, time_mode='all', prop=None, copy=False)[source]#
Filter observations based on missing data thresholds (features/measurements).
Keep only observations which have at least min_vars variables and/or at most max_vars variables. An observation is considered non-missing if it contains a valid (non-NaN / non-null) value. When a longitudinal EHRData is passed, filtering can be done across time points.
Only provide one of min_vars and/or max_vars.
- Parameters:
edata (
EHRData) – Central data object.layer (
str|None, default:None) – layer to use for filtering. If None (default), filtering is done on .X.min_vars (
int|None, default:None) – Minimum number of variables required for an observation to pass filtering.max_vars (
int|None, default:None) – Maximum number of variables allowed for an observation to pass filtering.time_mode (
Literal['all','any','proportion'], default:'all') –How to combine filtering criteria across the time axis. Only relevant if an EHRData is passed. Options are:
’all’ (default): The observation must pass the filtering criteria in all time points.
’any’: The observation must pass the filtering criteria in at least one time point.
- ’proportion’: The observation must pass the filtering criteria in at least a proportion prop of time points.
For example, with prop=0.3, the observation must pass the filtering criteria in at least 30% of the time points.
prop (
float|None, default:None) – Proportion of time points in which the observation must pass the filtering criteria. Only relevant if time_mode=’proportion’.copy (
bool, default:False) – Determines whether a copy is returned.
- Return type:
- Returns:
Depending on copy, subsets and annotates the passed data object and returns a filtered copy of the data object or acts in place
Examples
>>> import ehrapy as ep >>> edata = ed.dt.ehrdata_blobs( ... n_variables=45, n_observations=500, base_timepoints=15, missing_values=0.6, layer="tem_data" ... ) >>> edata.layers["tem_data"].shape (500, 45, 15) >>> ep.pp.filter_observations(edata, min_vars=10, time_mode="all", layer="tem_data") >>> edata.layers["tem_data"].shape (477, 45, 15)