Note
This page was generated from medcat.ipynb. Some tutorial content may look better in light mode.
Extracting from free text with MedCAT¶
This tutorial serves as an introduction on how to use ehrapy together with MedCAT. MedCat is a tool to extract medical entities from free text and link it to biomedical ontologies. Biomedical entities could be anything biomedical; not only diagnoses or diseases but also symptoms, drugs or even peptides. It also tries to keep the context of an extracted entitiy (for example, whether a specific disease has been diagnosed or not). This is especially important for electronic health records data, as most of the time doctors notes are simply copied and pasted into the data and not preprocessed in any form. Consider the following example:
The patient suffers from diabetes.
vs.
The patient does not suffer from diabetes.
In detail, ehrapy uses a pretrained and packages model from MedCat (https://medcat.readthedocs.io/en/latest/main.html#models). This model is limited in performance but good enough for this demonstration. A larger (trained) model is planned to be released somewhen in the (near) future by the MedCAT maintainers.
[15]:
import ehrapy as ep
from medcat.cat import CAT
[16]:
ep.settings.n_jobs = 2
Download the example data
[ ]:
!wget -nc https://raw.githubusercontent.com/CogStack/MedCATtutorials/main/notebooks/introductory/data/pt_notes.csv -P ./medcat_data/
!wget -nc https://cogstack-medcat-example-models.s3.eu-west-2.amazonaws.com/medcat-example-models/medmen_wstatus_2021_oct.zip -P ./medcat_data/
--2023-12-12 15:54:01-- https://raw.githubusercontent.com/CogStack/MedCATtutorials/main/notebooks/introductory/data/pt_notes.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3644222 (3,5M) [text/plain]
Saving to: ‘./medcat_data/pt_notes.csv’
pt_notes.csv 100%[===================>] 3,47M 22,4MB/s in 0,2s
2023-12-12 15:54:01 (22,4 MB/s) - ‘./medcat_data/pt_notes.csv’ saved [3644222/3644222]
--2023-12-12 15:54:02-- https://cogstack-medcat-example-models.s3.eu-west-2.amazonaws.com/medcat-example-models/medmen_wstatus_2021_oct.zip
Resolving cogstack-medcat-example-models.s3.eu-west-2.amazonaws.com (cogstack-medcat-example-models.s3.eu-west-2.amazonaws.com)... 3.5.244.164, 3.5.246.197, 52.95.149.106, ...
Connecting to cogstack-medcat-example-models.s3.eu-west-2.amazonaws.com (cogstack-medcat-example-models.s3.eu-west-2.amazonaws.com)|3.5.244.164|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 561947681 (536M) [application/zip]
Saving to: ‘./medcat_data/medmen_wstatus_2021_oct.zip’
medmen_wstatus_2021 100%[===================>] 535,92M 11,1MB/s in 81s
2023-12-12 15:55:23 (6,65 MB/s) - ‘./medcat_data/medmen_wstatus_2021_oct.zip’ saved [561947681/561947681]
First, we read the example data into an AnnData object, and perform ehrapy’s encode step.
[17]:
adata = ep.io.read_csv("medcat_data/pt_notes.csv", columns_obs_only=["text"])
adata_encoded = ep.pp.encode(adata, autodetect=True)
2023-12-14 23:16:13,116 - root INFO - Transformed passed DataFrame into an AnnData object with n_obs x n_vars = `1088` x `8`.
2023-12-14 23:16:13,117 - root INFO - The original categorical values `['category', 'gender']` were added to uns.
2023-12-14 23:16:13,126 - root INFO - Encoding strings in X to save to .h5ad. Loading the file will reverse the encoding.
2023-12-14 23:16:13,127 - root INFO - Updated the original layer after encoding.
2023-12-14 23:16:13,131 - root INFO - The original categorical values `['category', 'gender']` were added to obs.
Prepare CAT object as per MedCAT workflow¶
To leverage MedCAT for free text processing in ehrapy, we start by creating a CAT
object. This is described in more details in for example the MedCAT tutorials.
Here, we simply load a pretrained model from the maintainers of MedCAT which can be downloaded & readily used. For specific usecases, using more sophisticated refinements might be required. For this tutorial and a quick start, we will use this out-of-the-box model.
We also set some TUI filters. A full list of TUI’s can be found at: https://lhncbc.nlm.nih.gov/ii/tools/MetaMap/Docs/SemanticTypes_2018AB.txt
[18]:
# create the main ehrapy medcat object used for medical entitiy analysis using ehrapy and medcat
cat = CAT.load_model_pack("./medcat_data/medmen_wstatus_2021_oct.zip")
# only use diseases and behavioural disorders diagnoses in this example by filtering by TUI
tuis=["T047", "T048"]
cui_filters = set()
for type_id in tuis:
cui_filters.update(cat.cdb.addl_info["type_id2cuis"][type_id])
cat.cdb.config.linking["filters"]["cuis"] = cui_filters
Using this model pack we can already extract entities from an example note.
[19]:
text = "He was diagnosed with kidney failure"
doc = cat(text)
doc.ents
[19]:
(kidney failure,)
[20]:
# Example output of an extracted medcat entity; note that ehrapy will deal with this automatically and the here displayed manual extraction is not required.
# CUI: Concept Unique Identifier, which is just an unique identifier for each concept extracted
cat.get_entities("He was diagnosed with kidney failure", only_cui=False)
[20]:
{'entities': {2: {'pretty_name': 'Kidney Failure',
'cui': 'C0035078',
'type_ids': ['T047'],
'types': ['Disease or Syndrome'],
'source_value': 'kidney failure',
'detected_name': 'kidney~failure',
'acc': 1.0,
'context_similarity': 1.0,
'start': 22,
'end': 36,
'icd10': [],
'ontologies': [],
'snomed': [],
'id': 2,
'meta_anns': {'Status': {'value': 'Affirmed',
'confidence': 0.9999961853027344,
'name': 'Status'}}}},
'tokens': []}
Extracting and visualizing all disease entities¶
To extract all disease entities from our example dataset we require a complete annotation of the dataset. This step is computationally expensive and may take some time.
[21]:
ep.tl.annotate_text(adata_encoded, cat, text_column="text", n_proc=2)
The annotated results as extracted by MedCAT are transformed and stored into a Pandas DataFrame in adata.uns
.
[23]:
adata_encoded.uns["medcat_annotations"].head()
[23]:
row_nr | pretty_name | cui | type_ids | types | meta_anns | |
---|---|---|---|---|---|---|
0 | 0 | Diabetes | C0011847 | [T047] | [Disease or Syndrome] | Affirmed |
1 | 0 | Sepsis | C0243026 | [T047] | [Disease or Syndrome] | Other |
2 | 0 | Respiratory Distress Syndrome, Adult | C0035222 | [T047] | [Disease or Syndrome] | Affirmed |
3 | 0 | Pulmonary Embolism | C0034065 | [T047] | [Disease or Syndrome] | Other |
4 | 0 | Respiratory Failure | C1145670 | [T047] | [Disease or Syndrome] | Affirmed |
We can also get a proper overview for the top 10 most entities found in the data (affirmed diagnoses only).
[24]:
ep.tl.get_medcat_annotation_overview(adata_encoded).sort_values('n_patient_visit_percent', ascending=False).head(10)
[24]:
pretty_name | type_ids | types | n_patient_visit | n_patient_visit_percent | |
---|---|---|---|---|---|
cui | |||||
C0020538 | Hypertensive disease | T047 | Disease or Syndrome | 432 | 42.394504 |
C0028754 | Obesity | T047 | Disease or Syndrome | 245 | 24.043180 |
C0011847 | Diabetes | T047 | Disease or Syndrome | 149 | 14.622179 |
C0010054 | Coronary Arteriosclerosis | T047 | Disease or Syndrome | 127 | 12.463199 |
C0011849 | Diabetes Mellitus | T047 | Disease or Syndrome | 118 | 11.579980 |
C0012634 | Disease | T047 | Disease or Syndrome | 110 | 10.794897 |
C0038454 | Cerebrovascular accident | T047 | Disease or Syndrome | 110 | 10.794897 |
C0004096 | Asthma | T047 | Disease or Syndrome | 97 | 9.519136 |
C0024117 | Chronic Obstructive Airway Disease | T047 | Disease or Syndrome | 92 | 9.028459 |
C0011570 | Mental Depression | T048 | Mental or Behavioral Dysfunction | 91 | 8.930324 |
We can add annotated entities as binary columns to adata.obs
. From there, it can be used e.g. for plotting or any further analysis involving an annotation in adata.obs
. First, we calculate a UMAP embedding using the AnnData object and then we color by all patients that had detected annotations for e.g. Diabetes or Congestive heart failure.
[66]:
ep.tl.add_medcat_annotation_to_obs(adata_encoded, name=["Diabetes", "Congestive heart failure"])
[53]:
adata_encoded.obs.head()
[53]:
text | chartdate | dob | category | gender | Diabetes | Congestive heart failure | |
---|---|---|---|---|---|---|---|
0 | HISTORY OF PRESENT ILLNESS:, The patient is a ... | 2079-01-01 | 2018-01-01 | General Medicine | F | True | False |
1 | HISTORY OF PRESENT ILLNESS: , A 71-year-old fe... | 2079-01-01 | 2018-01-01 | Rheumatology | F | False | False |
2 | HISTORY OF PRESENT ILLNESS:, The patient is a ... | 2079-01-01 | 2018-01-01 | Consult - History and Phy. | F | True | False |
3 | CHIEF COMPLAINT:,1. Infection.,2. Pelvic pai... | 2037-01-01 | 2018-01-01 | Consult - History and Phy. | F | False | False |
4 | SUBJECTIVE:, This is a 29-year-old Vietnamese... | 2037-01-01 | 2018-01-01 | Dermatology | F | False | False |
Typos when trying to move annotations to obs will raise a warning with suggestions for the annotation
[54]:
try:
ep.tl.add_medcat_annotation_to_obs(adata_encoded, name=["Diubetes", "Congestive heart failure", "Heart failre", "hello"])
except Exception as e:
print(e)
Did not find ['Diubetes', 'Heart failre', 'hello'] in MedCAT's extracted entities and added them not to .obs. Do you mean ['Diabetes', 'Heart failure', 'Brucellosis']?
In this example, we consider NaN’s in the annotation to be False
, due to the lack of the model’s certainty for this entity.
[71]:
adata_encoded.obs["Diabetes"] = adata_encoded.obs["Diabetes"].fillna(False).astype(str)
adata_encoded.obs["Congestive heart failure"] = adata_encoded.obs["Congestive heart failure"].fillna(False).astype(str)
[72]:
ep.pp.neighbors(adata_encoded)
ep.tl.umap(adata_encoded)
ep.pl.umap(adata_encoded, color=["Diabetes", "Congestive heart failure"])
[73]:
ep.pl.tracksplot(adata_encoded, list(adata_encoded.var_names), groupby="Diabetes")