ehrapy.tools.ingest

Contents

ehrapy.tools.ingest#

ehrapy.tools.ingest(edata, edata_ref, *, obs=None, embedding_method=('umap', 'pca'), labeling_method='knn', neighbors_key=None, inplace=True, **kwargs)[source]#

Map labels and embeddings from reference data to new data.

Integrates embeddings and annotations of an edata with a reference dataset edata_ref through projecting on a PCA (or alternate model) that has been fitted on the reference data. The function uses a knn classifier for mapping labels and the UMAP package [MHM18] for mapping the embeddings.

Note

We refer to this asymmetric dataset integration as ingesting annotations from reference data to new data. This is different from learning a joint representation that integrates both datasets in an unbiased way, as CCA (e.g. in Seurat) or a conditional VAE (e.g. in scVI) would do.

You need to run neighbors() on edata_ref before passing it.

Parameters:
  • edata (EHRData | AnnData) – Central data object.

  • edata_ref (EHRData | AnnData) – The annotated data matrix of shape n_obs × n_vars. Rows correspond to observations and columns to features. Variables (n_vars and var_names) of edata_ref should be the same as in edata. This is the dataset with labels and embeddings which need to be mapped to edata.

  • obs (str | Iterable[str] | None, default: None) – Labels’ keys in edata_ref.obs which need to be mapped to edata.obs (inferred for observation of edata).

  • embedding_method (str | Iterable[str], default: ('umap', 'pca')) – Embeddings in edata_ref which need to be mapped to edata. The only supported values are ‘umap’ and ‘pca’.

  • labeling_method (str, default: 'knn') – The method to map labels in edata_ref.obs to edata.obs. The only supported value is ‘knn’.

  • neighbors_key (str | None, default: None) – If not specified, ingest looks edata_ref.uns[‘neighbors’] for neighbors settings and edata_ref.obsp[‘distances’] for distances (default storage places for pp.neighbors). If specified, ingest looks edata_ref.uns[neighbors_key] for neighbors settings and edata_ref.obsp[edata_ref.uns[neighbors_key][‘distances_key’]] for distances.

  • inplace (bool, default: True) – Only works if return_joint=False. Add labels and embeddings to the passed edata (if True) or return a copy of edata with mapped embeddings and labels.

  • **kwargs – Further keyword arguments for the Neighbor calculation

Return type:

EHRData | AnnData | None

Returns:

  • if inplace=False returns a copy of edata with mapped embeddings and labels in obsm and obs correspondingly

  • if inplace=True returns None and updates edata.obsm and edata.obs with mapped embeddings and labels

Examples

>>> import ehrapy as ep
>>> ep.pp.neighbors(edata_ref)
>>> ep.tl.umap(edata_ref)
>>> ep.tl.ingest(edata, edata_ref, obs="service_unit")