ehrapy.preprocessing.filter_features#
- ehrapy.preprocessing.filter_features(edata, *, layer=None, min_obs=None, max_obs=None, time_mode='all', prop=None, copy=False)[source]#
Filter features based on missing data thresholds.
Keep only features which have at least min_obs observations and/or have at most max_obs observations. An observation is considered non-missing if it contains a valid (non-NaN / non-null) value.
When a longitudinal EHRData is passed, filtering can be done across time points according to the specific time_mode.
Only provide one of min_obs and/or max_obs.
- Parameters:
layer (
str|None, default:None) – layer to use for filtering. If None (default), filtering is done on .X.min_obs (
int|None, default:None) – Minimum number of observations required for a feature to pass filtering.max_obs (
int|None, default:None) – Maximum number of observations allowed for a feature to pass filtering.time_mode (
Literal['all','any','proportion'], default:'all') –How to combine filtering criteria across the time axis. Use it only with 3 dimensional EHRData obejcts. Options are:
’all’ (default): The feature must pass the filtering criteria in all time points.
’any’: The feature must pass the filtering criteria in at least one time point.
- ’proportion’: The feature must pass the filtering criteria in at least a proportion prop of time points.
For example, with prop=0.3, the feature must pass the filtering criteria in at least 30% of the time points.
prop (
float|None, default:None) – Proportion of time points in which the feature must pass the filtering criteria. Only relevant if time_mode=’proportion’.copy (
bool, default:False) – Determines whether a copy is returned.
- Return type:
- Returns:
Depending on copy, subsets and annotates the passed data object and returns a filtered copy of the data object or acts in place
Examples
>>> import ehrapy as ep >>> edata = ed.dt.ehrdata_blobs( ... n_variables=45, n_observations=500, base_timepoints=15, missing_values=0.6, layer="tem_data" ... ) >>> edata.layers["tem_data"].shape (500, 45, 15) >>> ep.pp.filter_features(edata, min_obs=185, time_mode="all", layer="tem_data") >>> edata.layers["tem_data"].shape (500, 18, 15)