ehrapy.tools.ncp#
- ehrapy.tools.ncp(edata, *, layer, rank=4, n_iter_max=300, sigmoid_transform=False, key_added='ncp', random_state=0, copy=False)[source]#
Non-negative CP (PARAFAC) decomposition of a 3D temporal EHR layer.
CP (CANDECOMP/PARAFAC) decomposition factorises a 3-way tensor \(X \in \mathbb{R}^{I \times J \times K}\) into a sum of
rankouter products:\[X \approx \sum_r a_r \otimes b_r \otimes c_r\]where each triplet \((a_r, b_r, c_r)\) is a component:
\(a_r \in \mathbb{R}^I\) — patient factor: how strongly each observation expresses component r.
\(b_r \in \mathbb{R}^J\) — variable factor: which clinical variables are characteristic of component r.
\(c_r \in \mathbb{R}^K\) — temporal factor: how the pattern of component r evolves over the time axis.
The Non-negative variant (NCP) constrains all factors to be \(\geq 0\), which is natural for count-like or probability data and yields parts-based, interpretable components (analogous to NMF for matrices).
Factors are estimated by Multiplicative Updates (Lee & Seung, 2001). Each factor matrix is updated in closed form while the others are held fixed, cycling through the three modes until convergence:
\[F_{\text{mode}} \leftarrow F_{\text{mode}} \odot \frac{\mathcal{X}_{(\text{mode})} \, \mathrm{KR}(F_{-\text{mode}})} {F_{\text{mode}} \, \mathrm{KR}(F_{-\text{mode}})^\top \mathrm{KR}(F_{-\text{mode}}) + \varepsilon}\]where \(\mathcal{X}_{(\text{mode})}\) is the mode-n matricisation of the tensor and \(\mathrm{KR}\) denotes the Khatri–Rao product of the remaining factor matrices.
- Parameters:
edata (
EHRData) – Central data object.layer (
str) – Key of the 3D layer to decompose (shapen_obs × n_vars × n_time). All values must be non-negative (usesigmoid_transform=Truefor logit layers, ornp.abs/ clipping beforehand).rank (
int, default:4) – Number of components (rank of the decomposition). Each component describes one co-occurring patient sub-group, variable signature, and temporal trajectory.n_iter_max (
int, default:300) – Maximum number of multiplicative-update iterations. 300 is sufficient for most datasets; increase if the error has not converged (checkedata.uns[key_added]["params"]).sigmoid_transform (
bool, default:False) – IfTrue, apply a sigmoid transformation to the layer before decomposition. Useful when the layer contains raw logits.key_added (
str, default:'ncp') –Key prefix for storing results. Results are stored as:
edata.obsm["X_{key_added}"]— patient factors, shape(n_obs, rank).edata.varm["{key_added}_loadings"]— variable factors, shape(n_vars, rank).edata.uns["{key_added}"]["temporal_factors"]— temporal factors, shape(n_time, rank).
random_state (
int, default:0) – Random seed for the factor initialisation.copy (
bool, default:False) – Whether to return a copy rather than modifying in place.
- Return type:
- Returns:
Noneifcopy=False, else a modified copy ofedata.
Examples
>>> import ehrdata as ed, ehrapy as ep >>> edata = ed.dt.ehrdata_blobs(n_variables=8, n_centers=3, n_observations=30, base_timepoints=12) >>> ep.tl.ncp(edata, layer="tem_data", rank=3, sigmoid_transform=True) >>> edata.obsm["X_ncp"].shape (30, 3) >>> edata.varm["ncp_loadings"].shape (8, 3) >>> edata.uns["ncp"]["temporal_factors"].shape (12, 3)