ehrapy.preprocessing.pca

Contents

ehrapy.preprocessing.pca#

ehrapy.preprocessing.pca(data, *, n_comps=None, zero_center=True, svd_solver='arpack', random_state=0, return_info=False, dtype='float32', layer=None, copy=False, chunked=False, chunk_size=None)[source]#

Computes a principal component analysis.

Computes PCA coordinates, loadings and variance decomposition. Uses the implementation of scikit-learn.

Parameters:
  • data (EHRData | AnnData | ndarray | spmatrix) – Central data object.

  • n_comps (int | None, default: None) – Number of principal components to compute. Defaults to 50, or 1 - minimum dimension size of selected representation.

  • zero_center (bool | None, default: True) – If True, compute standard PCA from covariance matrix. If False, omit zero-centering variables (uses TruncatedSVD), which allows to handle sparse input efficiently. Passing None decides automatically based on sparseness of the data.

  • svd_solver (str, default: 'arpack') –

    SVD solver to use:

    • ’arpack’ (the default) for the ARPACK wrapper in SciPy (svds())

    • ’randomized’ for the randomized algorithm due to Halko (2009).

    • ’auto’ chooses automatically depending on the size of the problem.

    • ’lobpcg’ An alternative SciPy solver.

    Efficient computation of the principal components of a sparse matrix currently only works with the ‘arpack’ or ‘lobpcg’ solvers.

  • random_state (int | RandomState | None, default: 0) – Change to use different initial states for the optimization.

  • return_info (bool, default: False) – Only relevant when not passing an EHRData: or AnnData: see “Returns”.

  • dtype (str, default: 'float32') – Numpy data type string to which to convert the result.

  • layer (str | None, default: None) – The layer to operate on.

  • copy (bool, default: False) – If an EHRData: or AnnData: is passed, determines whether a copy is returned. Is ignored otherwise.

  • chunked (bool, default: False) – If True, perform an incremental PCA on segments of chunk_size. The incremental PCA automatically zero centers and ignores settings of random_seed and svd_solver. If False, perform a full PCA.

  • chunk_size (int | None, default: None) – Number of observations to include in each chunk. Required if chunked=True was passed.

Return type:

EHRData | AnnData | ndarray | spmatrix | None

Returns:

If data is array-like and return_info=False was passed, this function returns the PCA representation of data as an array of the same type as the input array.

Otherwise, it returns None if copy=False, else an updated AnnData object. Sets the following fields:

.obsm[‘X_pca’ | key_added]csr_matrix | csc_matrix | ndarray (shape (adata.n_obs, n_comps))

PCA representation of data.

.varm[‘PCs’ | key_added]ndarray (shape (adata.n_vars, n_comps))

The principal components containing the loadings.

.uns[‘pca’ | key_added][‘variance_ratio’]ndarray (shape (n_comps,))

Ratio of explained variance.

.uns[‘pca’ | key_added][‘variance’]ndarray (shape (n_comps,))

Explained variance, equivalent to the eigenvalues of the covariance matrix.