ehrapy.preprocessing.detect_bias#
- ehrapy.preprocessing.detect_bias(edata, sensitive_features, *, run_feature_importances=None, corr_threshold=0.5, smd_threshold=0.5, categorical_factor_threshold=2, feature_importance_threshold=0.1, prediction_confidence_threshold=0.5, corr_method='spearman', layer=None, copy=False)[source]#
Detects biases in the data using feature correlations, standardized mean differences, and feature importances.
Detects biases with respect to sensitive features, which can be either a specified subset of features or all features in .var. The method detects biases by computing:
pairwise correlations between features
standardized mean differences for numeric features between groups of sensitive features
value counts of categorical features between groups of sensitive features
feature importances for predicting one feature with another
Results of the computations are stored in .var, .varp, and .uns of the edata object. Values that exceed the specified thresholds are considered of interest and returned in the results dictionary. Be aware that the results depend on the encoding of the data. E.g. when using one-hot encoding, each group of a categorical feature will be treated as a separate feature, which can lead to an increased number of detected biases. Please take this into consideration when interpreting the results.
- Parameters:
edata (
EHRData|AnnData) – Central data object. Encoded features are required for bias detection.sensitive_features (
Iterable[str] |Literal['all']) – Sensitive features to consider for bias detection. If set to “all”, all features in .var will be considered.run_feature_importances (
bool|None, default:None) – Whether to run feature importances for detecting bias. If set to None, the function will run feature importances if sensitive_features is not set to “all”, as this can be computationally expensive.corr_threshold (
float, default:0.5) – The threshold for the correlation coefficient between two features to be considered of interest.smd_threshold (
float, default:0.5) – The threshold for the standardized mean difference between two features to be considered of interest.categorical_factor_threshold (
float, default:2) – The threshold for the factor between the value counts (as percentages) of a feature compared between two groups of a sensitive feature.feature_importance_threshold (
float, default:0.1) – The threshold for the feature importance of a sensitive feature for predicting another feature to be considered of interest.prediction_confidence_threshold (
float, default:0.5) – The threshold for the prediction confidence (R2 or accuracy) of a sensitive feature for predicting another feature to be considered of interest.corr_method (
Literal['pearson','spearman'], default:'spearman') – The correlation method to use.layer (
str|None, default:None) – The layer in .layers to use for computation. If None, .X will be used.copy (
bool, default:False) – If set to False, edata is updated in place. If set to True, the edata is copied and the results are stored in the copied edata, which is then returned.
- Return type:
dict[str,DataFrame] |tuple[dict[str,DataFrame],EHRData|AnnData]- Returns:
A dictionary containing the results of the bias detection. The keys are
”feature_correlations”: Pairwise correlations between features that exceed the correlation threshold.
”standardized_mean_differences”: Standardized mean differences between groups of sensitive features that exceed the SMD threshold.
”categorical_value_counts”: Value counts of categorical features between groups of sensitive features that exceed the categorical factor threshold.
”feature_importances”: Feature importances for predicting one feature with another that exceed the feature importance and prediction confidence thresholds.
If copy is set to True, the function returns a tuple with the results dictionary and the updated edata.
Examples
>>> import ehrdata as ed >>> import ehrapy as ep >>> edata = ed.dt.mimic_2() >>> ed.infer_feature_types(edata) >>> edata = ep.pp.encode(edata, autodetect=True, encodings="label") >>> results_dict = ep.pp.detect_bias(edata, "all")
>>> # Example with specified sensitive features >>> import ehrdata as ed >>> import ehrapy as ep >>> edata = ed.dt.diabetes_130_fairlearn() >>> ed.infer_feature_types(edata) >>> edata = ep.pp.encode(edata, autodetect=True, encodings="label") >>> results_dict = ep.pp.detect_bias(edata, sensitive_features=["race", "gender"])