ehrapy.preprocessing.qc_metrics#
- ehrapy.preprocessing.qc_metrics(edata, qc_vars=(), *, layer=None)[source]#
Calculates various quality control metrics.
Uses the original values to calculate the metrics and not the encoded ones. Look at the return type for a more in depth description of the default and extended metrics. If
infer_feature_types()is run first, then extended metrics that require feature type information are calculated in addition to default metrics.- Parameters:
- Return type:
- Returns:
Two Pandas DataFrames of all calculated QC metrics for obs and var respectively.
Default observation level metrics include:
missing_values_abs: Absolute amount of missing values.
missing_values_pct: Relative amount of missing values in percent.
entropy_of_missingness: Entropy of the missingness pattern for each observation. Higher values indicate a more heterogeneous (less structured) missingness pattern.
Extended observation level metrics include (only computed if
infer_feature_types()is run first): - unique_values_abs: Absolute amount of unique values. Returned asNaNfor numeric features. - unique_values_ratio: Relative amount of unique values in percent. Returned asNaNfor numeric features.Default feature level metrics include:
missing_values_abs: Absolute amount of missing values.
missing_values_pct: Relative amount of missing values in percent.
entropy_of_missingness: Entropy of the missingness pattern for each feature. Higher values indicate a more heterogeneous (less structured) missingness pattern.
mean: Mean value of the features.
median: Median value of the features.
std: Standard deviation of the features.
min: Minimum value of the features.
max: Maximum value of the features.
iqr_outliers: Whether the feature contains outliers based on the interquartile range (IQR) method.
Extended feature level metrics include (only computed if
infer_feature_types()is run first):unique_values_abs: Absolute amount of unique values. Returned as
NaNfor numeric featuresunique_values_ratio: Relative amount of unique values in percent. Returned as
NaNfor numeric featurescoefficient_of_variation: Coefficient of variation of the features.
is_constant: Whether the feature is constant (with near zero variance).
constant_variable_ratio: Relative amount of constant features in percent.
range_ratio: Relative dispersion of features values respective to their mean.
Examples
>>> import ehrapy as ep >>> edata = ed.dt.mimic_2() >>> obs_qc, var_qc = ep.pp.qc_metrics(edata) >>> obs_qc.head() >>> var_qc.head()