ehrapy.preprocessing.variable_correlations

ehrapy.preprocessing.variable_correlations#

ehrapy.preprocessing.variable_correlations(edata, *, layer=None, var_names=None, method='pearson', agg='mean', correction_method='bonferroni', alpha=0.05)[source]#

Compute correlation matrix with statistical testing and multiple testing correction.

This function computes pairwise correlations between variables in the given EHRData object, automatically handling missing values through pairwise deletion. For 3D time-series data, values are aggregated across time before computing correlations.

Parameters:
  • edata (EHRData) – Central data object.

  • layer (str | None, default: None) – Layer to extract data from. If None, .X will be used.

  • var_names (Sequence[str] | None, default: None) – List of variable names to compute correlation of. If None, uses all numeric variables.

  • method (Literal['spearman', 'pearson', 'kendall'], default: 'pearson') – Correlation method, “spearman”, “kendall” or “pearson”.

  • agg (Literal['mean', 'last', 'first'], default: 'mean') – How to aggregate time dimension: “mean”, “last” or “first”.

  • correction_method (Literal['bonferroni', 'fdr_bh', 'fdr_tsbh', 'holm', 'none'], default: 'bonferroni') – Multiple testing correction method: * ‘bonferroni’ conservative Bonferroni correction. * ‘fdr_bh’ Benjamini-Hochberg false discovery rate (FDR) control. * ‘fdr_tsbh’ two-stage Benjamini-Hochberg, better calibrated when many variables are truly correlated. * ‘holm’ Holm-Bonferroni correction. * ‘none’ no multiple-testing correction.

  • alpha (float, default: 0.05) – Significance threshold after correction.

Return type:

tuple[DataFrame, DataFrame, DataFrame]

Returns:

Correlation coefficient matrix, raw p-value matrix and boolean significance matrix after correction for each variable pair.

Examples

>>> import ehrdata as ed
>>> import ehrapy as ep
>>> edata = ed.dt.ehrdata_blobs(n_variables=10, n_centers=5, n_observations=200, base_timepoints=3)
>>> corr, pval, sig = ep.pp.compute_variable_correlations(
...     edata, layer="tem_data", method="pearson", agg="mean", correction_method="fdr_bh", alpha=0.02
... )