ehrapy.tools.cox_ph

ehrapy.tools.cox_ph(adata, duration_col, event_col=None, *, uns_key='cox_ph', alpha=0.05, label=None, baseline_estimation_method='breslow', penalizer=0.0, l1_ratio=0.0, strata=None, n_baseline_knots=4, knots=None, breakpoints=None, weights_col=None, cluster_col=None, entry_col=None, robust=False, formula=None, batch_mode=None, show_progress=False, initial_point=None, fit_options=None)[source]

Fit the Cox’s proportional hazard for the survival function.

The Cox proportional hazards model (CoxPH) examines the relationship between the survival time of subjects and one or more predictor variables. It models the hazard rate as a product of a baseline hazard function and an exponential function of the predictors, assuming proportional hazards over time. The results will be stored in the .uns slot of the AnnData object under the key ‘cox_ph’ unless specified otherwise in the uns_key parameter.

See https://lifelines.readthedocs.io/en/latest/fitters/regression/CoxPHFitter.html

Parameters:
  • adata (AnnData) – AnnData object.

  • duration_col (str) – The name of the column in the AnnData objects that contains the subjects’ lifetimes.

  • event_col (str, default: None) – The name of the column in the AnnData object that specifies whether the event has been observed, or censored. Column values are True if the event was observed, False if the event was lost (right-censored). If left None, all individuals are assumed to be uncensored.

  • uns_key (str, default: 'cox_ph') – The key to use for the .uns slot in the AnnData object.

  • alpha (float, default: 0.05) – The alpha value in the confidence intervals.

  • label (str | None, default: None) – The name of the column of the estimate.

  • baseline_estimation_method (Literal['breslow', 'spline', 'piecewise'], default: 'breslow') – The method used to estimate the baseline hazard. Options are ‘breslow’, ‘spline’, and ‘piecewise’.

  • penalizer (float | ndarray, default: 0.0) – Attach a penalty to the size of the coefficients during regression. This improves stability of the estimates and controls for high correlation between covariates.

  • l1_ratio (float, default: 0.0) – Specify what ratio to assign to a L1 vs L2 penalty. Same as scikit-learn. See penalizer above.

  • strata (list[str] | str | None, default: None) – specify a list of columns to use in stratification. This is useful if a categorical covariate does not obey the proportional hazard assumption. This is used similar to the strata expression in R. See http://courses.washington.edu/b515/l17.pdf.

  • n_baseline_knots (int, default: 4) – Used when baseline_estimation_method=”spline”. Set the number of knots (interior & exterior) in the baseline hazard, which will be placed evenly along the time axis. Should be at least 2. Royston et. al, the authors of this model, suggest 4 to start, but any values between 2 and 8 are reasonable. If you need to customize the timestamps used to calculate the curve, use the knots parameter instead.

  • knots (list[float] | None, default: None) – When baseline_estimation_method=”spline”, this allows customizing the points in the time axis for the baseline hazard curve. To use evenly-spaced points in time, the n_baseline_knots parameter can be employed instead.

  • breakpoints (list[float] | None, default: None) – Used when baseline_estimation_method=”piecewise”. Set the positions of the baseline hazard breakpoints.

  • weights_col (str | None, default: None) – The name of the column in DataFrame that contains the weights for each subject.

  • cluster_col (str | None, default: None) – The name of the column in DataFrame that contains the cluster variable. Using this forces the sandwich estimator (robust variance estimator) to be used.

  • entry_col (str, default: None) – Column denoting when a subject entered the study, i.e. left-truncation.

  • robust (bool, default: False) – Compute the robust errors using the Huber sandwich estimator, aka Wei-Lin estimate. This does not handle ties, so if there are high number of ties, results may significantly differ.

  • formula (str, default: None) – an Wilkinson formula, like in R and statsmodels, for the right-hand-side. If left as None, all columns not assigned as durations, weights, etc. are used. Uses the library Formulaic for parsing.

  • batch_mode (bool, default: None) – Enabling batch_mode can be faster for datasets with a large number of ties. If left as None, lifelines will choose the best option.

  • show_progress (bool, default: False) – Since the fitter is iterative, show convergence diagnostics. Useful if convergence is failing.

  • initial_point (ndarray | None, default: None) – set the starting point for the iterative solver.

  • fit_options (dict | None, default: None) – Additional keyword arguments to pass into the estimator.

Return type:

CoxPHFitter

Returns:

Fitted CoxPHFitter.

Examples

>>> import ehrapy as ep
>>> adata = ep.dt.mimic_2(encoded=False)
>>> # Flip 'censor_fl' because 0 = death and 1 = censored
>>> adata[:, ["censor_flg"]].X = np.where(adata[:, ["censor_flg"]].X == 0, 1, 0)
>>> cph = ep.tl.cox_ph(adata, "mort_day_censored", "censor_flg")