This page was generated from medcat.ipynb. Some tutorial content may look better in light mode.

Extracting from free text with MedCAT#

This tutorial serves as an introduction on how to use ehrapy together with MedCAT. MedCat is a tool to extract medical entities from free text and link it to biomedical ontologies. Biomedical entities could be anything biomedical; not only diagnoses or diseases but also symptoms, drugs or even peptides. It also tries to keep the context of an extracted entitiy (for example, whether a specific disease has been diagnosed or not). This is especially important for electronic health records data, as most of the time doctors notes are simply copied and pasted into the data and not preprocessed in any form. Consider the following example:

  • The patient suffers from diabetes.


  • The patient does not suffer from diabetes.

In detail, ehrapy uses a pretrained and packages model from MedCat ( This model is limited in performance but good enough for this demonstration. A larger (trained) model is planned to be released somewhen in the (near) future by the MedCAT maintainers.

import ehrapy as ep
from import CAT
ep.settings.n_jobs = 2

Download the example data

[ ]:
!wget -nc -P ./medcat_data/
!wget -nc -P ./medcat_data/
--2023-12-12 15:54:01--
Resolving (,,, ...
Connecting to (||:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3644222 (3,5M) [text/plain]
Saving to: ‘./medcat_data/pt_notes.csv’

pt_notes.csv        100%[===================>]   3,47M  22,4MB/s    in 0,2s

2023-12-12 15:54:01 (22,4 MB/s) - ‘./medcat_data/pt_notes.csv’ saved [3644222/3644222]

--2023-12-12 15:54:02--
Resolving (,,, ...
Connecting to (||:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 561947681 (536M) [application/zip]
Saving to: ‘./medcat_data/’

medmen_wstatus_2021 100%[===================>] 535,92M  11,1MB/s    in 81s

2023-12-12 15:55:23 (6,65 MB/s) - ‘./medcat_data/’ saved [561947681/561947681]

First, we read the example data into an AnnData object, and perform ehrapy’s encode step.

adata ="medcat_data/pt_notes.csv", columns_obs_only=["text"])
adata_encoded = ep.pp.encode(adata, autodetect=True)
2023-12-14 23:16:13,116 - root INFO - Transformed passed DataFrame into an AnnData object with n_obs x n_vars = `1088` x `8`.
2023-12-14 23:16:13,117 - root INFO - The original categorical values `['category', 'gender']` were added to uns.
2023-12-14 23:16:13,126 - root INFO - Encoding strings in X to save to .h5ad. Loading the file will reverse the encoding.
2023-12-14 23:16:13,127 - root INFO - Updated the original layer after encoding.
2023-12-14 23:16:13,131 - root INFO - The original categorical values `['category', 'gender']` were added to obs.

Prepare CAT object as per MedCAT workflow#

To leverage MedCAT for free text processing in ehrapy, we start by creating a CAT object. This is described in more details in for example the MedCAT tutorials.

Here, we simply load a pretrained model from the maintainers of MedCAT which can be downloaded & readily used. For specific usecases, using more sophisticated refinements might be required. For this tutorial and a quick start, we will use this out-of-the-box model.

We also set some TUI filters. A full list of TUI’s can be found at:

# create the main ehrapy medcat object used for medical entitiy analysis using ehrapy and medcat
cat = CAT.load_model_pack("./medcat_data/")

# only use diseases and behavioural disorders diagnoses in this example by filtering by TUI
tuis=["T047", "T048"]
cui_filters = set()
for type_id in tuis:
cat.cdb.config.linking["filters"]["cuis"] = cui_filters

Using this model pack we can already extract entities from an example note.

text = "He was diagnosed with kidney failure"
doc = cat(text)
(kidney failure,)
# Example output of an extracted medcat entity; note that ehrapy will deal with this automatically and the here displayed manual extraction is not required.
# CUI: Concept Unique Identifier, which is just an unique identifier for each concept extracted
cat.get_entities("He was diagnosed with kidney failure", only_cui=False)
{'entities': {2: {'pretty_name': 'Kidney Failure',
   'cui': 'C0035078',
   'type_ids': ['T047'],
   'types': ['Disease or Syndrome'],
   'source_value': 'kidney failure',
   'detected_name': 'kidney~failure',
   'acc': 1.0,
   'context_similarity': 1.0,
   'start': 22,
   'end': 36,
   'icd10': [],
   'ontologies': [],
   'snomed': [],
   'id': 2,
   'meta_anns': {'Status': {'value': 'Affirmed',
     'confidence': 0.9999961853027344,
     'name': 'Status'}}}},
 'tokens': []}

Extracting and visualizing all disease entities#

To extract all disease entities from our example dataset we require a complete annotation of the dataset. This step is computationally expensive and may take some time.

[21]:, cat, text_column="text", n_proc=2)

The annotated results as extracted by MedCAT are transformed and stored into a Pandas DataFrame in adata.uns.

row_nr pretty_name cui type_ids types meta_anns
0 0 Diabetes C0011847 [T047] [Disease or Syndrome] Affirmed
1 0 Sepsis C0243026 [T047] [Disease or Syndrome] Other
2 0 Respiratory Distress Syndrome, Adult C0035222 [T047] [Disease or Syndrome] Affirmed
3 0 Pulmonary Embolism C0034065 [T047] [Disease or Syndrome] Other
4 0 Respiratory Failure C1145670 [T047] [Disease or Syndrome] Affirmed

We can also get a proper overview for the top 10 most entities found in the data (affirmed diagnoses only).

[24]:'n_patient_visit_percent', ascending=False).head(10)
pretty_name type_ids types n_patient_visit n_patient_visit_percent
C0020538 Hypertensive disease T047 Disease or Syndrome 432 42.394504
C0028754 Obesity T047 Disease or Syndrome 245 24.043180
C0011847 Diabetes T047 Disease or Syndrome 149 14.622179
C0010054 Coronary Arteriosclerosis T047 Disease or Syndrome 127 12.463199
C0011849 Diabetes Mellitus T047 Disease or Syndrome 118 11.579980
C0012634 Disease T047 Disease or Syndrome 110 10.794897
C0038454 Cerebrovascular accident T047 Disease or Syndrome 110 10.794897
C0004096 Asthma T047 Disease or Syndrome 97 9.519136
C0024117 Chronic Obstructive Airway Disease T047 Disease or Syndrome 92 9.028459
C0011570 Mental Depression T048 Mental or Behavioral Dysfunction 91 8.930324

We can add annotated entities as binary columns to adata.obs. From there, it can be used e.g. for plotting or any further analysis involving an annotation in adata.obs. First, we calculate a UMAP embedding using the AnnData object and then we color by all patients that had detected annotations for e.g. Diabetes or Congestive heart failure.

[66]:, name=["Diabetes", "Congestive heart failure"])
text chartdate dob category gender Diabetes Congestive heart failure
0 HISTORY OF PRESENT ILLNESS:, The patient is a ... 2079-01-01 2018-01-01 General Medicine F True False
1 HISTORY OF PRESENT ILLNESS: , A 71-year-old fe... 2079-01-01 2018-01-01 Rheumatology F False False
2 HISTORY OF PRESENT ILLNESS:, The patient is a ... 2079-01-01 2018-01-01 Consult - History and Phy. F True False
3 CHIEF COMPLAINT:,1. Infection.,2. Pelvic pai... 2037-01-01 2018-01-01 Consult - History and Phy. F False False
4 SUBJECTIVE:, This is a 29-year-old Vietnamese... 2037-01-01 2018-01-01 Dermatology F False False

Typos when trying to move annotations to obs will raise a warning with suggestions for the annotation

try:, name=["Diubetes", "Congestive heart failure", "Heart failre", "hello"])
except Exception as e:
Did not find ['Diubetes', 'Heart failre', 'hello'] in MedCAT's extracted entities and added them not to .obs. Do you mean ['Diabetes', 'Heart failure', 'Brucellosis']?

In this example, we consider NaN’s in the annotation to be False, due to the lack of the model’s certainty for this entity.

adata_encoded.obs["Diabetes"] = adata_encoded.obs["Diabetes"].fillna(False).astype(str)
adata_encoded.obs["Congestive heart failure"] = adata_encoded.obs["Congestive heart failure"].fillna(False).astype(str)
ep.pp.neighbors(adata_encoded), color=["Diabetes", "Congestive heart failure"])
[73]:, list(adata_encoded.var_names), groupby="Diabetes")