Skip to content

You can download this notebook directly here

A gentle demo

import datetime
import pandas as pd

import eds_scikit
spark, sc, sql = eds_scikit.improve_performances() # 

Loading data

Data loading is made easy by using the HiveData object.
Simply give it the name of the database you want to use:

database_name = "MY_DATABASE_NAME"
from eds_scikit.io import HiveData

data = HiveData(
    database_name="database_name",
)

Now your tables are available as Koalas DataFrames: Those are basically Spark DataFrames which allows for the Pandas API to be used on top (see the Project description of eds-scikit's documentation for more informations.)

What we need to extract:

  • Patients with diabetes
  • Patients with Covid-19
  • Visits from those patients, and their ICU/Non-ICU status

Let us import what's necessary from eds-scikit:

from eds_scikit.event import conditions_from_icd10
from eds_scikit.event.diabetes import (
    diabetes_from_icd10,
    DEFAULT_DIABETE_FROM_ICD10_CONFIG,
)
from eds_scikit.icu import tag_icu_visit
DATE_MIN = datetime.datetime(2020, 1, 1)
DATE_MAX = datetime.datetime(2021, 6, 1)

Extracting the diabetic status

Luckily, a function is available to extract diabetic patients from ICD-10:

diabetes = diabetes_from_icd10(
    condition_occurrence=data.condition_occurrence,
    visit_occurrence=data.visit_occurrence,
    date_min=DATE_MIN,
    date_max=DATE_MAX,
)

We can check the default parameters used here:

DEFAULT_DIABETE_FROM_ICD10_CONFIG
{'additional_filtering': {'condition_status_source_value': {'DP', 'DAS'}},
 'codes': {'DIABETES_INSIPIDUS': {'code_list': ['E232', 'N251'],
                                  'code_type': 'exact'},
           'DIABETES_IN_PREGNANCY': {'code_list': ['O24'],
                                     'code_type': 'prefix'},
           'DIABETES_MALNUTRITION': {'code_list': ['E12'],
                                     'code_type': 'prefix'},
           'DIABETES_TYPE_I': {'code_list': ['E10'], 'code_type': 'prefix'},
           'DIABETES_TYPE_II': {'code_list': ['E11'], 'code_type': 'prefix'},
           'OTHER_DIABETES_MELLITUS': {'code_list': ['E13', 'E14'],
                                       'code_type': 'prefix'}},
 'date_from_visit': True,
 'default_code_type': 'prefix'}

We are only interested in diabetes mellitus, although we extracted other types of diabetes:

diabetes.concept.value_counts()
DIABETES_TYPE_II           117843
DIABETES_TYPE_I             10597
OTHER_DIABETES_MELLITUS      6031
DIABETES_IN_PREGNANCY        2597
DIABETES_INSIPIDUS           1089
DIABETES_MALNUTRITION         199
Name: concept, dtype: int64

We will restrict the types of diabetes used here:

diabetes_cohort = (
    diabetes[
        diabetes.concept.isin(
            {
                "DIABETES_TYPE_I",
                "DIABETES_TYPE_II",
                "OTHER_DIABETES_MELLITUS",
            }
        )
    ]
    .person_id.unique()
    .reset_index()
)
diabetes_cohort.loc[:, "HAS_DIABETE"] = True

Extracting the Covid status

Using the conditions_from_icd10 function, we will extract visits linked to COVID-19:

codes = dict(
    COVID=dict(
        code_list=r"U071[0145]", 
        code_type="regex",
    )
)

covid = conditions_from_icd10(
    condition_occurrence=data.condition_occurrence,
    visit_occurrence=data.visit_occurrence,
    codes=codes,
    date_min=DATE_MIN,
    date_max=DATE_MAX,
)

Now we can go from the visit_occurrence level to the visit_detail level.

visit_detail_covid = data.visit_detail.merge(
    covid[["visit_occurrence_id"]],
    on="visit_occurrence_id",
    how="inner",
)

Extracting ICU visits

What is left to do is to tag each visit as occurring in an ICU or not. This is achieved with the tag_icu_visit.
Like many functions in eds-scikit, this function exposes an algo argument, allowing you to choose how the tagging is done.
You can check the corresponding documentation to see the availables algos.

visit_detail_covid = tag_icu_visit(
    visit_detail=visit_detail_covid,
    care_site=data.care_site,
    algo="from_authorisation_type",
)
visit_detail_covid = visit_detail_covid.merge(
    diabetes_cohort, on="person_id", how="left"
)

visit_detail_covid["HAS_DIABETE"].fillna(False, inplace=True)
visit_detail_covid["IS_ICU"].fillna(False, inplace=True)

Finishing the analysis

Adding patient's age

We will add the patient's age at each visit_detail:

from eds_scikit.utils import datetime_helpers
visit_detail_covid = visit_detail_covid.merge(data.person[['person_id','birth_datetime']], 
                                              on='person_id', 
                                              how='inner')

visit_detail_covid["age"] = (
    datetime_helpers.substract_datetime(
        visit_detail_covid["visit_detail_start_datetime"],
        visit_detail_covid["birth_datetime"],
        out="hours",
    )
    / (24 * 365.25)
)

From distributed Koalas to local Pandas

All the computing above was done using Koalas DataFrames, which are distributed.
Now that we limited our cohort to a manageable size, we can switch to Pandas to finish our analysis.

visit_detail_covid_pd = visit_detail_covid[
    ["person_id", "age", "HAS_DIABETE", "IS_ICU"]
].to_pandas()

Grouping by patient

stats = (
    visit_detail_covid_pd[["person_id", "age", "HAS_DIABETE", "IS_ICU"]]
    .groupby("person_id")
    .agg(
        HAS_DIABETE=("HAS_DIABETE", "any"), 
        IS_ICU=("IS_ICU", "any"), 
        age=("age", "min"),
    )
)

Binning the age into intervals

stats["age"] = pd.cut(
    stats.age,
    bins=[0, 40, 50, 60, 70, 120],
    labels=["(0, 40]", "(40, 50]", "(50, 60]", "(60, 70]", "(70, 120]"],
)

Computing the ratio of patients that had an ICU visit

stats = stats.groupby(["age", "HAS_DIABETE"], as_index=False).apply(
    lambda x: x["IS_ICU"].sum() / len(x)
)

stats.columns = ["age", "cohorte", "percent_icu"]

stats["cohorte"] = stats["cohorte"].replace({True: "Diab.", False: "Control"})

Results

We can finally plot our results using Altair:

import altair as alt
bars = (
    alt.Chart(
        stats,
        title=[
            "Percentage of patients who went through ICU during their COVID stay, ",
            "as a function of their age range and diabetic status",
            " ",
        ],
    )
    .mark_bar()
    .encode(
        x=alt.X("cohorte:N", title=""),
        y=alt.Y(
            "percent_icu",
            title="% of patients who went through ICU.",
            axis=alt.Axis(format="%"),
        ),
        color=alt.Color("cohorte:N", title="Cohort"),
        column=alt.Column("age:N", title="Age range"),
    )
)

bars = bars.configure_title(anchor="middle", baseline="bottom")
bars