Note

This page was generated from medcat.ipynb. Some tutorial content may look better in light mode.

Extracting from free text with MedCAT#

This tutorial serves as an introduction on how to use ehrapy together with MedCAT. MedCat is a tool to extract medical entities from free text and link it to biomedical ontologies. Biomedical entities could be anything biomedical; not only diagnoses or diseases but also symptoms, drugs or even peptides. It also tries to keep the context of an extracted entitiy (for example, whether a specific disease has been diagnosed or not). This is especially important for electronic health records data, as most of the time doctors notes are simply copied and pasted into the data and not preprocessed in any form. Consider the following example:

  • The patient suffers from diabetes.

vs.

  • The patient does not suffer from diabetes.

In detail, ehrapy uses a pretrained and packages model from MedCat (https://medcat.readthedocs.io/en/latest/main.html#models). This model is limited in performance but good enough for this demonstration. A larger (trained) model is planned to be released somewhen in the (near) future by the MedCAT maintainers.

[1]:
import ehrapy as ep
import pandas as pd
[2]:
ep.settings.n_jobs = 2

Download the example data

[3]:
!wget -nc https://raw.githubusercontent.com/CogStack/MedCATtutorials/main/notebooks/introductory/data/pt_notes.csv -P ./medcat_data/
!wget -nc https://medcat.rosalind.kcl.ac.uk/media/medmen_wstatus_2021_oct.zip -P ./medcat_data/
File ‘./medcat_data/pt_notes.csv’ already there; not retrieving.

File ‘./medcat_data/medmen_wstatus_2021_oct.zip’ already there; not retrieving.

Custom MedCAT object of ehrapy#

To allow for seemless interoperability of ehrapy and MedCAT, we require a “superobject” that references the AnnData object while providing the MedCAT functionality. This object further stores MedCAT related features such as a vocabulary, a concept database and (later on in this tutorial) also the annotated results.

First, we read the example data into an AnnData object, encode the data and calculate neighbors for subsequent processing. As a next step, we create the MedCAT object with a modelpack and set default TUI filters. A full list of TUI’s can be found at: https://lhncbc.nlm.nih.gov/ii/tools/MetaMap/Docs/SemanticTypes_2018AB.txt

[4]:
adata = ep.io.read_csv("medcat_data/pt_notes.csv", columns_obs_only=["text"])
adata_encoded = ep.pp.encode(adata, autodetect=True)
ep.pp.neighbors(adata_encoded)

# create the main ehrapy medcat object used for medical entitiy analysis using ehrapy and medcat
ep_medcat = ep.tl.MedCAT(
    adata_encoded, model_pack_path="./medcat_data/medmen_wstatus_2021_oct.zip"
)
# only use diseases and behavioural disorders diagnoses in this example by filtering by TUI
ep_medcat.set_filter_by_tui(tuis=["T047", "T048"])
2023-08-10 17:54:04,813 - root INFO - Transformed passed dataframe into an AnnData object with n_obs x n_vars = `1088` x `8`.
2023-08-10 17:54:04,814 - root INFO - The original categorical values `['gender', 'category']` were added to uns.
2023-08-10 17:54:04,832 - root INFO - Encoding strings in X to save to .h5ad. Loading the file will reverse the encoding.
2023-08-10 17:54:04,837 - root INFO - Updated the original layer after encoding.
2023-08-10 17:54:04,853 - root INFO - The original categorical values `['gender', 'category']` were added to obs.

Using this model pack we can already extract entities from an example note.

[5]:
text = "He was diagnosed with kidney failure"
doc = ep_medcat.cat(text)
doc.ents
[5]:
(kidney failure,)
[6]:
# Example output of an extracted medcat entity; note that ehrapy will deal with this automatically and the here displayed manual extraction is not required.
# CUI: Concept Unique Identifier, which is just an unique identifier for each concept extracted
ep_medcat.cat.get_entities("He was diagnosed with kidney failure", only_cui=False)
[6]:
{'entities': {2: {'pretty_name': 'Kidney Failure',
   'cui': 'C0035078',
   'type_ids': ['T047'],
   'types': ['Disease or Syndrome'],
   'source_value': 'kidney failure',
   'detected_name': 'kidney~failure',
   'acc': 1.0,
   'context_similarity': 1.0,
   'start': 22,
   'end': 36,
   'icd10': [],
   'ontologies': [],
   'snomed': [],
   'id': 2,
   'meta_anns': {'Status': {'value': 'Affirmed',
     'confidence': 0.9999961853027344,
     'name': 'Status'}}}},
 'tokens': []}

Extracting and visualizing all disease entities#

To extract all disease entities from our example dataset we require a complete annotation of the dataset. This step is computationally expensive and may take some time.

[7]:
ep.tl.mc.annotate_text(ep_medcat, text_column="text", n_proc=2)

The annotated results as extracted by MedCAT are transformed and stored into a Pandas DataFrame

[8]:
ep_medcat.annotated_results
[8]:
row_nr pretty_name cui type_ids types meta_anns
0 0 Degenerative polyarthritis C0029408 [T047] [Disease or Syndrome] Affirmed
1 0 Atrial Fibrillation C0004238 [T047] [Disease or Syndrome] Affirmed
2 0 Hypertensive disease C0020538 [T047] [Disease or Syndrome] Affirmed
3 0 Asthma C0004096 [T047] [Disease or Syndrome] Affirmed
4 0 Sleep Apnea, Obstructive C0520679 [T047] [Disease or Syndrome] Affirmed
... ... ... ... ... ... ...
10781 999 Diabetes C0011847 [T047] [Disease or Syndrome] Other
10782 999 Diabetes C0011847 [T047] [Disease or Syndrome] Affirmed
10783 999 Diabetes C0011847 [T047] [Disease or Syndrome] Affirmed
10784 999 Diabetes Mellitus C0011849 [T047] [Disease or Syndrome] Affirmed
10785 999 Diabetes C0011847 [T047] [Disease or Syndrome] Affirmed

10786 rows × 6 columns

We can also get a proper overview for the top 10 most entities found in the data (affirmed diagnoses only).

[9]:
ep.tl.mc.get_annotation_overview(ep_medcat)
  pretty_name                  type_ids   types                         n_patient_visit   n_patient_visit_percent 
 ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  Hypertensive disease         T047       Disease or Syndrome           432               42.39450441609421
  Obesity                      T047       Disease or Syndrome           245               24.043179587831208      
  Diabetes                     T047       Disease or Syndrome           149               14.622178606476938
  Coronary Arteriosclerosis    T047       Disease or Syndrome           127               12.463199214916584      
  Diabetes Mellitus            T047       Disease or Syndrome           118               11.579980372914623
  Disease                      T047       Disease or Syndrome           110               10.794896957801766      
  Cerebrovascular accident     T047       Disease or Syndrome           110               10.794896957801766
  Asthma                       T047       Disease or Syndrome           97                9.519136408243375       
  Chronic Obstructive Airway   T047       Disease or Syndrome           92                9.028459273797841
  Disease
  Mental Depression            T048       Mental or Behavioral          91                8.930323846908735       
                                          Dysfunction                                                             

Alternatively, we can also get an overview of the top 10 negated diagnoses only.

[10]:
ep.tl.mc.get_annotation_overview(ep_medcat, status="Other")
  pretty_name                 type_ids   types                          n_patient_visit   n_patient_visit_percent 
 ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  Cerebrovascular accident    T047       Disease or Syndrome            119               15.973154362416109
  Erythema                    T047       Disease or Syndrome            85                11.409395973154362      
  Diabetes                    T047       Disease or Syndrome            78                10.46979865771812
  Hypertensive disease        T047       Disease or Syndrome            68                9.12751677852349        
  disorder lesions skin       T047       Disease or Syndrome            68                9.12751677852349
  Lymphadenopathy             T047       Disease or Syndrome            64                8.590604026845638       
  nervous system disorder     T047       Disease or Syndrome            58                7.785234899328859
  Coronary Arteriosclerosis   T047       Disease or Syndrome            50                6.7114093959731544      
  Drug abuse                  T048       Mental or Behavioral           50                6.7114093959731544
                                         Dysfunction
  Pneumonia                   T047       Disease or Syndrome            50                6.7114093959731544      

Since our MedCAT is aware of both, the annotation results and the AnnData object, we can just pass it to any plotting functions of ehrapy, just we like we are used to with AnnData objects. First, we calculate a UMAP embedding using the AnnData object and then we color by all patients that had some form of Pneumonia.

[11]:
ep.tl.umap(ep_medcat.anndata)
ep.pl.umap(ep_medcat, color=["Diabetes", "Congestive heart failure"])
../../_images/tutorials_notebooks_medcat_20_0.png

It is still possible to color by the original columns in the AnnData object with the MedCAT object.

[12]:
ep.pl.umap(ep_medcat, color=["gender", "age_year", "Diabetes"])
../../_images/tutorials_notebooks_medcat_22_0.png

Passing an AnnData object instead of an ehrapy MedCAT object will also work, but not with extracted entities.

[13]:
ep.pl.umap(ep_medcat.anndata, color=["gender", "age_year"])
../../_images/tutorials_notebooks_medcat_24_0.png

Typos are automatically fixed by ehrapy whenever possible.

[14]:
ep.pl.umap(ep_medcat, color=["Diubetes", "Heart Failure"])
Did not find Diubetes in MedCAT's extracted entities. Will use best match Diabetes!
Did not find Heart Failure in MedCAT's extracted entities. Will use best match Heart failure!
../../_images/tutorials_notebooks_medcat_26_2.png

All other features of ehrapy are of course also available.

[15]:
ep.tl.leiden(ep_medcat.anndata, resolution=0.5, key_added="leiden_0_5")
ep.tl.paga(ep_medcat.anndata, groups="leiden_0_5")

# paga currently does not support medcat extracted entities directly (as of ehrapy 0.2.0)
ep.pl.paga(
    ep_medcat.anndata,
    color=["leiden_0_5", "gender"],
    cmap=ep.pl.Colormaps.grey_red.value,
    title=["Leiden 0.5", "Gender"],
)
../../_images/tutorials_notebooks_medcat_28_0.png
[16]:
ep.tl.draw_graph(ep_medcat.anndata, init_pos="paga")
ep.pl.draw_graph(ep_medcat, color=["gender", "Diabetes"], legend_loc="on data")
WARNING: Package 'fa2' is not installed, falling back to layout 'fr'.To use the faster and better ForceAtlas2 layout, install package 'fa2' (`pip install fa2`).
../../_images/tutorials_notebooks_medcat_29_1.png
[17]:
ep.pl.tracksplot(ep_medcat, list(ep_medcat.anndata.var_names), groupby="Diabetes")
../../_images/tutorials_notebooks_medcat_30_0.png