ehrapy.preprocessing.qc_metrics

Contents

ehrapy.preprocessing.qc_metrics#

ehrapy.preprocessing.qc_metrics(edata, qc_vars=(), *, layer=None)[source]#

Calculates various quality control metrics.

Uses the original values to calculate the metrics and not the encoded ones. Look at the return type for a more in depth description of the default and extended metrics. If infer_feature_types() is run first, then extended metrics that require feature type information are calculated in addition to default metrics.

Parameters:
  • edata (EHRData) – Central data object.

  • qc_vars (Collection[str], default: ()) – Optional List of vars to calculate additional metrics for.

  • layer (str | None, default: None) – Layer to use to calculate the metrics.

Return type:

tuple[DataFrame, DataFrame]

Returns:

Two Pandas DataFrames of all calculated QC metrics for obs and var respectively.

Default observation level metrics include:

  • missing_values_abs: Absolute amount of missing values.

  • missing_values_pct: Relative amount of missing values in percent.

  • entropy_of_missingness: Entropy of the missingness pattern for each observation. Higher values indicate a more heterogeneous (less structured) missingness pattern.

Extended observation level metrics include (only computed if infer_feature_types() is run first): - unique_values_abs: Absolute amount of unique values. Returned as NaN for numeric features. - unique_values_ratio: Relative amount of unique values in percent. Returned as NaN for numeric features.

Default feature level metrics include:

  • missing_values_abs: Absolute amount of missing values.

  • missing_values_pct: Relative amount of missing values in percent.

  • entropy_of_missingness: Entropy of the missingness pattern for each feature. Higher values indicate a more heterogeneous (less structured) missingness pattern.

  • mean: Mean value of the features.

  • median: Median value of the features.

  • std: Standard deviation of the features.

  • min: Minimum value of the features.

  • max: Maximum value of the features.

  • iqr_outliers: Whether the feature contains outliers based on the interquartile range (IQR) method.

Extended feature level metrics include (only computed if infer_feature_types() is run first):

  • unique_values_abs: Absolute amount of unique values. Returned as NaN for numeric features

  • unique_values_ratio: Relative amount of unique values in percent. Returned as NaN for numeric features

  • coefficient_of_variation: Coefficient of variation of the features.

  • is_constant: Whether the feature is constant (with near zero variance).

  • constant_variable_ratio: Relative amount of constant features in percent.

  • range_ratio: Relative dispersion of features values respective to their mean.

Examples

>>> import ehrapy as ep
>>> edata = ed.dt.mimic_2()
>>> obs_qc, var_qc = ep.pp.qc_metrics(edata)
>>> obs_qc.head()
>>> var_qc.head()