Note
This page was generated from fhir.ipynb. Some tutorial content may look better in light mode.
FHIR example¶
FHIR (Fast Healthcare Interoperability Resources) is a standard for healthcare data exchange, developed by Health Level Seven International (HL7). It is built on modern web technologies and is designed to enable easier access to healthcare information, supporting JSON, XML, and RDF formats for data representation. FHIR defines a set of “resources” that represent granular clinical concepts, like patients, admissions, and medications, facilitating integration with existing healthcare systems and the development of new applications.
We want to emphasize that FHIR data is primarily designed for data exchanged and not for observational retrospective data. Such data should be stored in OMOP format.
Here, we’ll show a quick example on a synthetic dataset to demonstrate how one could work with FHIR data in ehrapy.
[1]:
import ehrapy as ep
[2]:
%%capture
!wget https://synthetichealth.github.io/synthea-sample-data/downloads/latest/synthea_sample_data_fhir_latest.zip
!mkdir fhir_dataset
!unzip synthea_sample_data_fhir_latest.zip -d fhir_dataset
FHIR data is often times nested and contains lists and dictionaries. Generally, there are three options to deal with this: 1. Transform the data into an awkward array and flatten it when needed. 2. Extract values from all lists and dictionaries to store single values in the fields. 3. Remove all lists and dictionaries. Only do this if the information is not relevant to you.
Here, we’ll work with Pandas DataFrames
to be able to apply option 3.
[3]:
df = ep.io.read_fhir("fhir_dataset", return_df=True)
df = df[:1000] # The dataset is very large so we subset to the first 1000 records
[4]:
# Option 3: We're dropping any columns that contain lists or dictionaries and all columns that only contain NA values
df.drop(columns=[col for col in df.columns if any(isinstance(x, (list, dict)) for x in df[col].dropna())], inplace=True)
df.drop(columns=df.columns[df.isna().all()], inplace=True)
[5]:
adata = ep.ad.df_to_anndata(df, index_column="id")
adata
[5]:
AnnData object with n_obs × n_vars = 1000 × 68
layers: 'original'
[6]:
ep.ad.infer_feature_types(adata)
❗ Feature resource.suppliedItem.quantity.value was detected as a categorical feature stored numerically. Please verify.
Detected feature types for AnnData object with 1000 obs and 68 vars ╠══ 📅 Date features ║ ╠══ resource.abatementDateTime ║ ╠══ resource.authoredOn ║ ╠══ resource.billablePeriod.end ║ ╠══ resource.billablePeriod.start ║ ╠══ resource.birthDate ║ ╠══ resource.context.period.end ║ ╠══ resource.context.period.start ║ ╠══ resource.created ║ ╠══ resource.date ║ ╠══ resource.deceasedDateTime ║ ╠══ resource.effectiveDateTime ║ ╠══ resource.expirationDate ║ ╠══ resource.issued ║ ╠══ resource.manufactureDate ║ ╠══ resource.occurrenceDateTime ║ ╠══ resource.onsetDateTime ║ ╠══ resource.performedPeriod.end ║ ╠══ resource.performedPeriod.start ║ ╠══ resource.period.end ║ ╠══ resource.period.start ║ ╠══ resource.recorded ║ ╚══ resource.recordedDate ╠══ 📐 Numerical features ║ ╠══ resource.distinctIdentifier ║ ╠══ resource.lotNumber ║ ╠══ resource.payment.amount.value ║ ╠══ resource.serialNumber ║ ╠══ resource.total.value ║ ╚══ resource.valueQuantity.value ╚══ 🗂️ Categorical features ╠══ fullUrl (1000 categories) ╠══ patientId (5 categories) ╠══ request.method (1 categories) ╠══ request.url (16 categories) ╠══ resource.claim.reference (126 categories) ╠══ resource.class.code (3 categories) ╠══ resource.class.system (1 categories) ╠══ resource.code.text (125 categories) ╠══ resource.custodian.reference (8 categories) ╠══ resource.encounter.reference (57 categories) ╠══ resource.facility.reference (8 categories) ╠══ resource.gender (1 categories) ╠══ resource.intent (1 categories) ╠══ resource.location.reference (4 categories) ╠══ resource.maritalStatus.text (2 categories) ╠══ resource.medicationCodeableConcept.text (15 categories) ╠══ resource.multipleBirthBoolean (1 categories) ╠══ resource.outcome (1 categories) ╠══ resource.patient.reference (2 categories) ╠══ resource.payment.amount.currency (1 categories) ╠══ resource.prescription.reference (70 categories) ╠══ resource.primarySource (1 categories) ╠══ resource.provider.reference (16 categories) ╠══ resource.referral.reference (1 categories) ╠══ resource.requester.reference (6 categories) ╠══ resource.resourceType (16 categories) ╠══ resource.serviceProvider.reference (8 categories) ╠══ resource.status (8 categories) ╠══ resource.subject.reference (2 categories) ╠══ resource.suppliedItem.itemCodeableConcept.text (1 categories) ╠══ resource.suppliedItem.quantity.value (1 categories) ╠══ resource.text.status (1 categories) ╠══ resource.total.currency (1 categories) ╠══ resource.type.text (1 categories) ╠══ resource.use (1 categories) ╠══ resource.vaccineCode.text (5 categories) ╠══ resource.valueCodeableConcept.text (5 categories) ╠══ resource.valueQuantity.code (21 categories) ╠══ resource.valueQuantity.system (1 categories) ╚══ resource.valueQuantity.unit (21 categories)
[ ]:
ep.pp.knn_impute(adata)
Quality control metrics missing. Calculating...
Feature resource.period.start had more than 94.80% missing values!
Feature resource.medicationCodeableConcept.text had more than 97.40% missing values!
Feature resource.distinctIdentifier had more than 99.50% missing values!
Feature resource.type.text had more than 99.50% missing values!
Feature resource.text.status had more than 99.40% missing values!
Feature resource.facility.reference had more than 88.80% missing values!
Feature resource.abatementDateTime had more than 97.70% missing values!
Feature resource.vaccineCode.text had more than 98.10% missing values!
Feature resource.referral.reference had more than 93.10% missing values!
Feature resource.lotNumber had more than 99.50% missing values!
Feature resource.outcome had more than 93.10% missing values!
Feature resource.primarySource had more than 98.10% missing values!
Feature resource.location.reference had more than 90.80% missing values!
Feature resource.class.code had more than 95.60% missing values!
Feature resource.created had more than 86.20% missing values!
Feature resource.context.period.start had more than 95.70% missing values!
Feature resource.custodian.reference had more than 95.70% missing values!
Feature resource.patient.reference had more than 82.40% missing values!
Feature resource.authoredOn had more than 97.40% missing values!
Feature resource.claim.reference had more than 93.10% missing values!
Feature resource.valueCodeableConcept.text had more than 96.90% missing values!
Feature resource.valueString had more than 99.90% missing values!
Feature resource.context.period.end had more than 95.70% missing values!
Feature resource.suppliedItem.itemCodeableConcept.text had more than 98.60% missing values!
Feature resource.total.currency had more than 93.10% missing values!
Feature resource.payment.amount.value had more than 93.10% missing values!
Feature resource.deceasedDateTime had more than 99.90% missing values!
Feature resource.total.value had more than 93.10% missing values!
Feature resource.serialNumber had more than 99.50% missing values!
Feature resource.requester.reference had more than 97.40% missing values!
Feature resource.recordedDate had more than 95.90% missing values!
Feature resource.payment.amount.currency had more than 93.10% missing values!
Feature resource.performedPeriod.start had more than 92.70% missing values!
Feature resource.gender had more than 99.80% missing values!
Feature resource.onsetDateTime had more than 95.90% missing values!
Feature resource.date had more than 95.70% missing values!
Feature resource.provider.reference had more than 86.20% missing values!
Feature resource.recorded had more than 99.90% missing values!
Feature resource.manufactureDate had more than 99.50% missing values!
Feature resource.billablePeriod.start had more than 86.20% missing values!
Feature resource.class.system had more than 95.60% missing values!
Feature resource.maritalStatus.text had more than 99.80% missing values!
Feature resource.multipleBirthBoolean had more than 99.80% missing values!
Feature resource.serviceProvider.reference had more than 95.60% missing values!
Feature resource.use had more than 86.20% missing values!
Feature resource.performedPeriod.end had more than 92.70% missing values!
Feature resource.period.end had more than 95.60% missing values!
Feature resource.expirationDate had more than 99.50% missing values!
Feature resource.birthDate had more than 99.80% missing values!
Feature resource.occurrenceDateTime had more than 96.70% missing values!
Feature resource.intent had more than 97.00% missing values!
Feature resource.billablePeriod.end had more than 86.20% missing values!
Feature resource.suppliedItem.quantity.value had more than 98.60% missing values!
Feature resource.prescription.reference had more than 97.40% missing values!
scikit-learn-intelex is not available. Install via pip install scikit-learn-intelex for faster imputations.
[26]:
adata = ep.pp.encode(adata, autodetect=True)
2024-04-20 16:44:36,559 - root INFO - The original categorical values `['fullUrl', 'resource.resourceType', 'resource.text.status', 'resource.gender', 'resource.birthDate', 'resource.maritalStatus.text', 'resource.multipleBirthBoolean', 'request.method', 'request.url', 'resource.status', 'resource.class.system', 'resource.class.code', 'resource.subject.reference', 'resource.period.start', 'resource.period.end', 'resource.serviceProvider.reference', 'resource.code.text', 'resource.encounter.reference', 'resource.onsetDateTime', 'resource.recordedDate', 'resource.effectiveDateTime', 'resource.issued', 'resource.date', 'resource.custodian.reference', 'resource.context.period.start', 'resource.context.period.end', 'resource.use', 'resource.patient.reference', 'resource.billablePeriod.start', 'resource.billablePeriod.end', 'resource.created', 'resource.provider.reference', 'resource.facility.reference', 'resource.total.currency', 'resource.referral.reference', 'resource.claim.reference', 'resource.outcome', 'resource.payment.amount.currency', 'resource.abatementDateTime', 'resource.distinctIdentifier', 'resource.manufactureDate', 'resource.expirationDate', 'resource.lotNumber', 'resource.serialNumber', 'resource.type.text', 'resource.valueQuantity.unit', 'resource.valueQuantity.system', 'resource.valueQuantity.code', 'resource.valueCodeableConcept.text', 'resource.performedPeriod.start', 'resource.performedPeriod.end', 'resource.location.reference', 'resource.vaccineCode.text', 'resource.occurrenceDateTime', 'resource.primarySource', 'resource.valueString', 'resource.intent', 'resource.medicationCodeableConcept.text', 'resource.authoredOn', 'resource.requester.reference', 'resource.prescription.reference', 'resource.recorded', 'patientId', 'resource.deceasedDateTime', 'resource.suppliedItem.itemCodeableConcept.text']` were added to uns.
2024-04-20 16:44:36,740 - root INFO - Updated the original layer after encoding.
/Users/mamba/PycharmProjects/ehrapy/ehrapy/preprocessing/_encoding.py:282: DeprecationWarning: Converting `np.inexact` or `np.floating` to a dtype is deprecated. The current result is `float64` which is not strictly correct.
encoded_ann_data.X = encoded_ann_data.X.astype(np.number)
[11]:
ep.pp.pca(adata)
ep.pp.neighbors(adata, n_pcs=10)
ep.tl.umap(adata)
ep.tl.leiden(adata, resolution=0.3, key_added="leiden_0_3")
[12]:
ep.pl.umap(adata, color=["leiden_0_3"], title="Leiden 0.3", size=20)