ehrapy.tools.ncp

Contents

ehrapy.tools.ncp#

ehrapy.tools.ncp(edata, *, layer, rank=4, n_iter_max=300, sigmoid_transform=False, key_added='ncp', random_state=0, copy=False)[source]#

Non-negative CP (PARAFAC) decomposition of a 3D temporal EHR layer.

CP (CANDECOMP/PARAFAC) decomposition factorises a 3-way tensor \(X \in \mathbb{R}^{I \times J \times K}\) into a sum of rank outer products:

\[X \approx \sum_r a_r \otimes b_r \otimes c_r\]

where each triplet \((a_r, b_r, c_r)\) is a component:

  • \(a_r \in \mathbb{R}^I\)patient factor: how strongly each observation expresses component r.

  • \(b_r \in \mathbb{R}^J\)variable factor: which clinical variables are characteristic of component r.

  • \(c_r \in \mathbb{R}^K\)temporal factor: how the pattern of component r evolves over the time axis.

The Non-negative variant (NCP) constrains all factors to be \(\geq 0\), which is natural for count-like or probability data and yields parts-based, interpretable components (analogous to NMF for matrices).

Factors are estimated by Multiplicative Updates (Lee & Seung, 2001). Each factor matrix is updated in closed form while the others are held fixed, cycling through the three modes until convergence:

\[F_{\text{mode}} \leftarrow F_{\text{mode}} \odot \frac{\mathcal{X}_{(\text{mode})} \, \mathrm{KR}(F_{-\text{mode}})} {F_{\text{mode}} \, \mathrm{KR}(F_{-\text{mode}})^\top \mathrm{KR}(F_{-\text{mode}}) + \varepsilon}\]

where \(\mathcal{X}_{(\text{mode})}\) is the mode-n matricisation of the tensor and \(\mathrm{KR}\) denotes the Khatri–Rao product of the remaining factor matrices.

Parameters:
  • edata (EHRData) – Central data object.

  • layer (str) – Key of the 3D layer to decompose (shape n_obs × n_vars × n_time). All values must be non-negative (use sigmoid_transform=True for logit layers, or np.abs / clipping beforehand).

  • rank (int, default: 4) – Number of components (rank of the decomposition). Each component describes one co-occurring patient sub-group, variable signature, and temporal trajectory.

  • n_iter_max (int, default: 300) – Maximum number of multiplicative-update iterations. 300 is sufficient for most datasets; increase if the error has not converged (check edata.uns[key_added]["params"]).

  • sigmoid_transform (bool, default: False) – If True, apply a sigmoid transformation to the layer before decomposition. Useful when the layer contains raw logits.

  • key_added (str, default: 'ncp') –

    Key prefix for storing results. Results are stored as:

    • edata.obsm["X_{key_added}"] — patient factors, shape (n_obs, rank).

    • edata.varm["{key_added}_loadings"] — variable factors, shape (n_vars, rank).

    • edata.uns["{key_added}"]["temporal_factors"] — temporal factors, shape (n_time, rank).

  • random_state (int, default: 0) – Random seed for the factor initialisation.

  • copy (bool, default: False) – Whether to return a copy rather than modifying in place.

Return type:

EHRData | None

Returns:

None if copy=False, else a modified copy of edata.

Examples

>>> import ehrdata as ed, ehrapy as ep
>>> edata = ed.dt.ehrdata_blobs(n_variables=8, n_centers=3, n_observations=30, base_timepoints=12)
>>> ep.tl.ncp(edata, layer="tem_data", rank=3, sigmoid_transform=True)
>>> edata.obsm["X_ncp"].shape
(30, 3)
>>> edata.varm["ncp_loadings"].shape
(8, 3)
>>> edata.uns["ncp"]["temporal_factors"].shape
(12, 3)