{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Cohort Tracking" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The use of non-representative samples of a population when doing research on health questions gives raise to serious issues, importantly:\n", "\n", "- conclusions drawn on subgroups (by age, gender, race, ...) may not generalize to other subgroups\n", "- underrepresented group's characteristics may be hidden by the overrepresented group's data\n", "- models, such as clinical algorithms, trained on such samples may pick up, or even further amplify, biases in e.g. clinical decision making\n", "\n", "For studies addressing medical questions, it is necessary to define exclusion and inclusion criteria.\n", "To detect, track and monitor the effects of such criteria on the composition of the study cohort, [Ellen et al.](https://www.medrxiv.org/content/10.1101/2023.10.05.23296611v1.full.pdf) propose a visual aid in the form of a flowchart diagram.\n", "\n", "Here, we show how ehrapy can help to track and visualize key demographics of interest during filtering steps." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Environment setup" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "import ehrapy as ep\n", "from tableone import TableOne" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load the data\n", "\n", "We load the Diabetes 130-Hospitals dataset, which comes with a convenience loader in ehrapy.\n", "\n", "More information on the dataset can be found [here](https://archive.ics.uci.edu/dataset/296/diabetes+130-us+hospitals+for+years+1999-2008).\n", "We use a preprocessed version by fairlearn, from which more information can be found [here](https://fairlearn.org/main/user_guide/datasets/diabetes_hospital_data.html)." ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "adata = ep.dt.diabetes_130_fairlearn(\n", " columns_obs_only=[\"gender\", \"race\", \"time_in_hospital\", \"medicaid\"]\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Inspecting the dataset with `tableone`\n", "\n", "[tableone](https://github.com/tompollard/tableone/) generates summary statistics for a patient population including the proportion of missing values (if any).\n", "`tableone` works on `pandas.DataFrame` objects, and hence interacts seamlessly with the `.obs` field of the `AnnData` object:" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | \n", " | Missing | \n", "Overall | \n", "
|---|---|---|---|
| n | \n", "\n", " | \n", " | 101766 | \n", "
| gender, n (%) | \n", "Female | \n", "0 | \n", "54708 (53.8) | \n", "
| Male | \n", "\n", " | 47055 (46.2) | \n", "|
| Unknown/Invalid | \n", "\n", " | 3 (0.0) | \n", "|
| race, n (%) | \n", "AfricanAmerican | \n", "0 | \n", "19210 (18.9) | \n", "
| Asian | \n", "\n", " | 641 (0.6) | \n", "|
| Caucasian | \n", "\n", " | 76099 (74.8) | \n", "|
| Hispanic | \n", "\n", " | 2037 (2.0) | \n", "|
| Other | \n", "\n", " | 1506 (1.5) | \n", "|
| Unknown | \n", "\n", " | 2273 (2.2) | \n", "|
| time_in_hospital, mean (SD) | \n", "\n", " | 0 | \n", "4.4 (3.0) | \n", "
| medicaid, n (%) | \n", "False | \n", "0 | \n", "98234 (96.5) | \n", "
| True | \n", "\n", " | 3532 (3.5) | \n", "