Note

This page was generated from out_of_core.ipynb. Some tutorial content may look better in light mode.

Using ehrapy with Large Datasets

Modern health datasets can become very large. When datasets are so large they cannot be loaded into a computer’s memory at once, loading and processing the data in batches becomes necessary. This is also called doing the computations “out-of-core”.

Dask is a popular out-of-core, distributed array processing library that ehrapy is beginning to support. Here we show how dask support in ehrapy can reduce the memory consumption of a simple ehrapy processing workflow.

🔪 Beware sharp edges! 🔪

dask support in ehrapy is new and highly experimental!

Many functions in ehrapy do not support dask and may exhibit unexpected behaviour if dask arrays are passed to them. Stick to what’s outlined in this tutorial and you should be fine!

Please report any issues you run into over on the issue tracker.

Example Usecase

We can now profile the required time and memory consumption of two runs for processing this data:

  1. In memory (which is feasible with our demo dataset)

  2. Out-of-core

We will compare these two on a synthetic dataset of 50’000 samples and 1’000 features, with 4 distinct groups underlying the data generation process.

On this dataset, we

  1. Scale the data to zero mean and unit variance

  2. Compute a PCA

  3. Compute a neighborhood graph on PCA space

  4. Perform clustering in the neighborhood graph

  5. Project the data to the top two Principal Components space, and color the found clusters for visualization.

Profiled Code

Memory

For the in-memory setting, the following code is used to generate profiling results:

import scalene

scalene.scalene_profiler.stop()

import pandas as pd
from sklearn.datasets import make_blobs as make_blobs
import ehrapy as ep
import anndata as ad

n_individuals = 50000
n_features = 1000
n_groups = 4
chunks = 1000

data_features, data_labels = make_blobs(
    n_samples=n_individuals, n_features=n_features, centers=n_groups, random_state=42
)

var = pd.DataFrame({"feature_type": ["numeric"] * n_features})

adata = ad.AnnData(X=data_features, obs={"label": data_labels}, var=var)

scalene.scalene_profiler.start()

ep.pp.scale_norm(adata)

ep.pp.pca(adata)

ep.pp.neighbors(adata)

ep.tl.leiden(adata)

ep.pl.pca(adata, color="leiden", save="profiling_memory_pca.png")

scalene.scalene_profiler.stop()

Out-of-core

For the out-of-core setting, the following code is used to generate profiling results:

import scalene

scalene.scalene_profiler.stop()

import dask.array as da
from sklearn.datasets import make_blobs as make_blobs
import ehrapy as ep
import anndata as ad
import pandas as pd

n_individuals = 50000
n_features = 1000
n_groups = 4
chunks = 1000

data_features, data_labels = make_blobs(
    n_samples=n_individuals, n_features=n_features, centers=n_groups, random_state=42
)

data_features = da.from_array(data_features, chunks=chunks)

var = pd.DataFrame({"feature_type": ["numeric"] * n_features})

adata = ad.AnnData(X=data_features, obs={"label": data_labels}, var=var)

scalene.scalene_profiler.start()

ep.pp.scale_norm(adata)

ep.pp.pca(adata)

adata.obsm["X_pca"] = adata.obsm["X_pca"].compute()

ep.pp.neighbors(adata)

ep.tl.leiden(adata)

ep.pl.pca(adata, color="leiden", save="profiling_out_of_core_pca.png")

scalene.scalene_profiler.stop()

Optional: Try it Yourself

Click here for instructions of how to run the profiling results yourself.

Workflow:

  1. Setup The results shown in this notebook rely on optional dependencies of ehrapy. Also, we will use scalene for profiling. You can install these required tools into your environment with:

pip install ehrapy[dask] scalene
  1. Profile runs Scalene currently requires code to be run as Python script for a full profile. For this, copy the above code snippets into two Python files “profile_memory.py” and “profile_out_of_core.py”, respectively. Then, from your commmand line within this environment run

scalene profile_memory.py --outfile profile_memory.html

for the in-memory computation and

scalene profile_out_of_core.py --outfile profile_out_of_core.html

for the out-of-core computation.