You can download this notebook directly here

A gentle demo

import datetime
import pandas as pd

import eds_scikit

spark, sc, sql = eds_scikit.improve_performances() # (1)

See the welcome page for an explanation of this line

Loading data

Data loading is made easy by using the HiveData object.
Simply give it the name of the database you want to use:

database_name = "MY_DATABASE_NAME"

from eds_scikit.io import HiveData

data = HiveData(
    database_name="database_name",
)

Now your tables are available as Koalas DataFrames: Those are basically Spark DataFrames which allows for the Pandas API to be used on top (see the Project description of eds-scikit's documentation for more informations.)

What we need to extract:

Patients with diabetes
Patients with Covid-19
Visits from those patients, and their ICU/Non-ICU status

Let us import what's necessary from eds-scikit:

from eds_scikit.event import conditions_from_icd10
from eds_scikit.event.diabetes import (
    diabetes_from_icd10,
    DEFAULT_DIABETE_FROM_ICD10_CONFIG,
)
from eds_scikit.icu import tag_icu_visit

DATE_MIN = datetime.datetime(2020, 1, 1)
DATE_MAX = datetime.datetime(2021, 6, 1)

Extracting the diabetic status

Luckily, a function is available to extract diabetic patients from ICD-10:

diabetes = diabetes_from_icd10(
    condition_occurrence=data.condition_occurrence,
    visit_occurrence=data.visit_occurrence,
    date_min=DATE_MIN,
    date_max=DATE_MAX,
)

We can check the default parameters used here:

DEFAULT_DIABETE_FROM_ICD10_CONFIG

{'additional_filtering': {'condition_status_source_value': {'DP', 'DAS'}},
 'codes': {'DIABETES_INSIPIDUS': {'code_list': ['E232', 'N251'],
                                  'code_type': 'exact'},
           'DIABETES_IN_PREGNANCY': {'code_list': ['O24'],
                                     'code_type': 'prefix'},
           'DIABETES_MALNUTRITION': {'code_list': ['E12'],
                                     'code_type': 'prefix'},
           'DIABETES_TYPE_I': {'code_list': ['E10'], 'code_type': 'prefix'},
           'DIABETES_TYPE_II': {'code_list': ['E11'], 'code_type': 'prefix'},
           'OTHER_DIABETES_MELLITUS': {'code_list': ['E13', 'E14'],
                                       'code_type': 'prefix'}},
 'date_from_visit': True,
 'default_code_type': 'prefix'}

We are only interested in diabetes mellitus, although we extracted other types of diabetes:

diabetes.concept.value_counts()

DIABETES_TYPE_II           117843
DIABETES_TYPE_I             10597
OTHER_DIABETES_MELLITUS      6031
DIABETES_IN_PREGNANCY        2597
DIABETES_INSIPIDUS           1089
DIABETES_MALNUTRITION         199
Name: concept, dtype: int64

We will restrict the types of diabetes used here:

diabetes_cohort = (
    diabetes[
        diabetes.concept.isin(
            {
                "DIABETES_TYPE_I",
                "DIABETES_TYPE_II",
                "OTHER_DIABETES_MELLITUS",
            }
        )
    ]
    .person_id.unique()
    .reset_index()
)
diabetes_cohort.loc[:, "HAS_DIABETE"] = True

Extracting the Covid status

Using the conditions_from_icd10 function, we will extract visits linked to COVID-19:

codes = dict(
    COVID=dict(
        code_list=r"U071[0145]", 
        code_type="regex",
    )
)

covid = conditions_from_icd10(
    condition_occurrence=data.condition_occurrence,
    visit_occurrence=data.visit_occurrence,
    codes=codes,
    date_min=DATE_MIN,
    date_max=DATE_MAX,
)

Now we can go from the visit_occurrence level to the visit_detail level.

visit_detail_covid = data.visit_detail.merge(
    covid[["visit_occurrence_id"]],
    on="visit_occurrence_id",
    how="inner",
)

Extracting ICU visits

What is left to do is to tag each visit as occurring in an ICU or not. This is achieved with the tag_icu_visit.
Like many functions in eds-scikit, this function exposes an algo argument, allowing you to choose how the tagging is done.
You can check the corresponding documentation to see the availables algos.

visit_detail_covid = tag_icu_visit(
    visit_detail=visit_detail_covid,
    care_site=data.care_site,
    algo="from_authorisation_type",
)

visit_detail_covid = visit_detail_covid.merge(
    diabetes_cohort, on="person_id", how="left"
)

visit_detail_covid["HAS_DIABETE"].fillna(False, inplace=True)
visit_detail_covid["IS_ICU"].fillna(False, inplace=True)

Finishing the analysis

Adding patient's age

We will add the patient's age at each visit_detail:

from eds_scikit.utils import datetime_helpers

visit_detail_covid = visit_detail_covid.merge(data.person[['person_id','birth_datetime']], 
                                              on='person_id', 
                                              how='inner')

visit_detail_covid["age"] = (
    datetime_helpers.substract_datetime(
        visit_detail_covid["visit_detail_start_datetime"],
        visit_detail_covid["birth_datetime"],
        out="hours",
    )
    / (24 * 365.25)
)

From distributed Koalas to local Pandas

All the computing above was done using Koalas DataFrames, which are distributed.
Now that we limited our cohort to a manageable size, we can switch to Pandas to finish our analysis.

visit_detail_covid_pd = visit_detail_covid[
    ["person_id", "age", "HAS_DIABETE", "IS_ICU"]
].to_pandas()

Grouping by patient

stats = (
    visit_detail_covid_pd[["person_id", "age", "HAS_DIABETE", "IS_ICU"]]
    .groupby("person_id")
    .agg(
        HAS_DIABETE=("HAS_DIABETE", "any"), 
        IS_ICU=("IS_ICU", "any"), 
        age=("age", "min"),
    )
)

Binning the age into intervals

stats["age"] = pd.cut(
    stats.age,
    bins=[0, 40, 50, 60, 70, 120],
    labels=["(0, 40]", "(40, 50]", "(50, 60]", "(60, 70]", "(70, 120]"],
)

Computing the ratio of patients that had an ICU visit

stats = stats.groupby(["age", "HAS_DIABETE"], as_index=False).apply(
    lambda x: x["IS_ICU"].sum() / len(x)
)

stats.columns = ["age", "cohorte", "percent_icu"]

stats["cohorte"] = stats["cohorte"].replace({True: "Diab.", False: "Control"})

Results

stats

	age	cohorte	percent_icu
0	(0, 40]	Control	0.327988
1	(0, 40]	Diab.	0.445578
2	(40, 50]	Control	0.263667
3	(40, 50]	Diab.	0.427203
4	(50, 60]	Control	0.315931
5	(50, 60]	Diab.	0.464736
6	(60, 70]	Control	0.356808
7	(60, 70]	Diab.	0.474766
8	(70, 120]	Control	0.159337
9	...	...	...

We can finally plot our results using Altair:

import altair as alt

bars = (
    alt.Chart(
        stats,
        title=[
            "Percentage of patients who went through ICU during their COVID stay, ",
            "as a function of their age range and diabetic status",
            " ",
        ],
    )
    .mark_bar()
    .encode(
        x=alt.X("cohorte:N", title=""),
        y=alt.Y(
            "percent_icu",
            title="% of patients who went through ICU.",
            axis=alt.Axis(format="%"),
        ),
        color=alt.Color("cohorte:N", title="Cohort"),
        column=alt.Column("age:N", title="Age range"),
    )
)

bars = bars.configure_title(anchor="middle", baseline="bottom")
bars