ehrapy.preprocessing.detect_bias

ehrapy.preprocessing.detect_bias(adata, sensitive_features, *, run_feature_importances=None, corr_threshold=0.5, smd_threshold=0.5, categorical_factor_threshold=2, feature_importance_threshold=0.1, prediction_confidence_threshold=0.5, corr_method='spearman', layer=None, copy=False)[source]

Detects biases in the data using feature correlations, standardized mean differences, and feature importances.

Detects biases with respect to sensitive features, which can be either a specified subset of features or all features in adata.var. The method detects biases by computing:

  • pairwise correlations between features

  • standardized mean differences for numeric features between groups of sensitive features

  • value counts of categorical features between groups of sensitive features

  • feature importances for predicting one feature with another

Results of the computations are stored in var, varp, and uns of the adata object. Values that exceed the specified thresholds are considered of interest and returned in the results dictionary. Be aware that the results depend on the encoding of the data. E.g. when using one-hot encoding, each group of a categorical feature will be treated as a separate feature, which can lead to an increased number of detected biases. Please take this into consideration when interpreting the results.

Parameters:
  • adata (AnnData) – An annotated data matrix containing EHR data. Encoded features are required for bias detection.

  • sensitive_features (Union[Iterable[str], Literal['all']]) – Sensitive features to consider for bias detection. If set to “all”, all features in adata.var will be considered.

  • run_feature_importances (bool | None) – Whether to run feature importances for detecting bias. If set to None, the function will run feature importances if sensitive_features is not set to “all”, as this can be computationally expensive. Defaults to None.

  • corr_threshold (float) – The threshold for the correlation coefficient between two features to be considered of interest. Defaults to 0.5.

  • smd_threshold (float) – The threshold for the standardized mean difference between two features to be considered of interest. Defaults to 0.5.

  • categorical_factor_threshold (float) – The threshold for the factor between the value counts (as percentages) of a feature compared between two groups of a sensitive feature. Defaults to 2.

  • feature_importance_threshold (float) – The threshold for the feature importance of a sensitive feature for predicting another feature to be considered of interest. Defaults to 0.1.

  • prediction_confidence_threshold (float) – The threshold for the prediction confidence (R2 or accuracy) of a sensitive feature for predicting another feature to be considered of interest. Defaults to 0.5.

  • corr_method (Literal['pearson', 'spearman']) – The correlation method to use. Defaults to “spearman”.

  • layer (str | None) – The layer in adata.layers to use for computation. If None, adata.X will be used. Defaults to None.

  • copy (bool) – If set to False, adata is updated in place. If set to True, the adata is copied and the results are stored in the copied adata, which is then returned. Defaults to False.

Return type:

dict[str, DataFrame] | tuple[dict[str, DataFrame], AnnData]

Returns:

A dictionary containing the results of the bias detection. The keys are

  • ”feature_correlations”: Pairwise correlations between features that exceed the correlation threshold.

  • ”standardized_mean_differences”: Standardized mean differences between groups of sensitive features that exceed the SMD threshold.

  • ”categorical_value_counts”: Value counts of categorical features between groups of sensitive features that exceed the categorical factor threshold.

  • ”feature_importances”: Feature importances for predicting one feature with another that exceed the feature importance and prediction confidence thresholds.

If copy is set to True, the function returns a tuple with the results dictionary and the updated adata.

Examples

>>> import ehrapy as ep
>>> adata = ep.dt.mimic_2(encoded=True)
>>> ep.ad.infer_feature_types(adata)
>>> adata = ep.pp.encode(adata, autodetect=True, encodings="label")
>>> results_dict = ep.pp.detect_bias(adata, "all")
>>> # Example with specified sensitive features
>>> import ehrapy as ep
>>> adata = ep.dt.diabetes_130_fairlearn()
>>> ep.ad.infer_feature_types(adata)
>>> adata = ep.pp.encode(adata, autodetect=True, encodings="label")
>>> results_dict = ep.pp.detect_bias(adata, sensitive_features=["race", "gender"])