ehrapy.preprocessing.pca#
- ehrapy.preprocessing.pca(data, *, n_comps=None, zero_center=True, svd_solver='arpack', random_state=0, mask_var=None, return_info=False, dtype='float32', layer=None, copy=False, chunked=False, chunk_size=None)[source]#
Computes a principal component analysis.
Computes PCA coordinates, loadings and variance decomposition. Uses the implementation of scikit-learn.
- Parameters:
data (
EHRData|AnnData|ndarray|spmatrix) – Central data object.n_comps (
int|None, default:None) – Number of principal components to compute. Defaults to 50, or 1 - minimum dimension size of selected representation.zero_center (
bool|None, default:True) – If True, compute standard PCA from covariance matrix. If False, omit zero-centering variables (usesTruncatedSVD), which allows to handle sparse input efficiently. Passing None decides automatically based on sparseness of the data.svd_solver (
str, default:'arpack') –SVD solver to use:
’arpack’ (the default) for the ARPACK wrapper in SciPy (
svds())’randomized’ for the randomized algorithm due to Halko (2009).
’auto’ chooses automatically depending on the size of the problem.
’lobpcg’ An alternative SciPy solver.
Efficient computation of the principal components of a sparse matrix currently only works with the ‘arpack’ or ‘lobpcg’ solvers.
random_state (
int|RandomState|None, default:0) – Change to use different initial states for the optimization.return_info (
bool, default:False) – Only relevant when not passing anEHRData: orAnnData: see “Returns”.mask_var (
ndarray[tuple[Any,...],dtype[bool]] |str|None, default:None) – To run only on a certain set of genes given by a boolean array or a string referring to an array in var. By default, uses .var[‘highly_variable’] if available, else everything.dtype (
str, default:'float32') – Numpy data type string to which to convert the result.layer (
str|None, default:None) – The layer to operate on.copy (
bool, default:False) – If anEHRData: orAnnData: is passed, determines whether a copy is returned. Is ignored otherwise.chunked (
bool, default:False) – If True, perform an incremental PCA on segments of chunk_size. The incremental PCA automatically zero centers and ignores settings of random_seed and svd_solver. If False, perform a full PCA.chunk_size (
int|None, default:None) – Number of observations to include in each chunk. Required if chunked=True was passed.
- Return type:
- Returns:
If data is array-like and return_info=False was passed, this function returns the PCA representation of data as an array of the same type as the input array.
Otherwise, it returns None if copy=False, else an updated AnnData object. Sets the following fields:
- .obsm[‘X_pca’ | key_added]
csr_matrix|csc_matrix|ndarray(shape (adata.n_obs, n_comps)) PCA representation of data.
- .varm[‘PCs’ | key_added]
ndarray(shape (adata.n_vars, n_comps)) The principal components containing the loadings.
- .uns[‘pca’ | key_added][‘variance_ratio’]
ndarray(shape (n_comps,)) Ratio of explained variance.
- .uns[‘pca’ | key_added][‘variance’]
ndarray(shape (n_comps,)) Explained variance, equivalent to the eigenvalues of the covariance matrix.
- .obsm[‘X_pca’ | key_added]