You can download this notebook directly here
A gentle demo
import datetime
import pandas as pd
import eds_scikit
spark, sc, sql = eds_scikit.improve_performances() #
Loading data
Data loading is made easy by using the HiveData
object.
Simply give it the name of the database you want to use:
database_name = "MY_DATABASE_NAME"
from eds_scikit.io import HiveData
data = HiveData(
database_name="database_name",
)
Now your tables are available as Koalas DataFrames: Those are basically Spark DataFrames which allows for the Pandas API to be used on top (see the Project description of eds-scikit's documentation for more informations.)
What we need to extract:
- Patients with diabetes
- Patients with Covid-19
- Visits from those patients, and their ICU/Non-ICU status
Let us import what's necessary from eds-scikit:
from eds_scikit.event import conditions_from_icd10
from eds_scikit.event.diabetes import (
diabetes_from_icd10,
DEFAULT_DIABETE_FROM_ICD10_CONFIG,
)
from eds_scikit.icu import tag_icu_visit
DATE_MIN = datetime.datetime(2020, 1, 1)
DATE_MAX = datetime.datetime(2021, 6, 1)
Extracting the diabetic status
Luckily, a function is available to extract diabetic patients from ICD-10:
diabetes = diabetes_from_icd10(
condition_occurrence=data.condition_occurrence,
visit_occurrence=data.visit_occurrence,
date_min=DATE_MIN,
date_max=DATE_MAX,
)
We can check the default parameters used here:
DEFAULT_DIABETE_FROM_ICD10_CONFIG
We are only interested in diabetes mellitus, although we extracted other types of diabetes:
diabetes.concept.value_counts()
We will restrict the types of diabetes used here:
diabetes_cohort = (
diabetes[
diabetes.concept.isin(
{
"DIABETES_TYPE_I",
"DIABETES_TYPE_II",
"OTHER_DIABETES_MELLITUS",
}
)
]
.person_id.unique()
.reset_index()
)
diabetes_cohort.loc[:, "HAS_DIABETE"] = True
Extracting the Covid status
Using the conditions_from_icd10
function, we will extract visits linked to COVID-19:
codes = dict(
COVID=dict(
code_list=r"U071[0145]",
code_type="regex",
)
)
covid = conditions_from_icd10(
condition_occurrence=data.condition_occurrence,
visit_occurrence=data.visit_occurrence,
codes=codes,
date_min=DATE_MIN,
date_max=DATE_MAX,
)
Now we can go from the visit_occurrence
level to the visit_detail
level.
visit_detail_covid = data.visit_detail.merge(
covid[["visit_occurrence_id"]],
on="visit_occurrence_id",
how="inner",
)
Extracting ICU visits
What is left to do is to tag each visit as occurring in an ICU or not. This is achieved with the tag_icu_visit
.
Like many functions in eds-scikit, this function exposes an algo
argument, allowing you to choose how the tagging is done.
You can check the corresponding documentation to see the availables algos
.
visit_detail_covid = tag_icu_visit(
visit_detail=visit_detail_covid,
care_site=data.care_site,
algo="from_authorisation_type",
)
visit_detail_covid = visit_detail_covid.merge(
diabetes_cohort, on="person_id", how="left"
)
visit_detail_covid["HAS_DIABETE"].fillna(False, inplace=True)
visit_detail_covid["IS_ICU"].fillna(False, inplace=True)
Finishing the analysis
Adding patient's age
We will add the patient's age at each visit_detail
:
from eds_scikit.utils import datetime_helpers
visit_detail_covid = visit_detail_covid.merge(data.person[['person_id','birth_datetime']],
on='person_id',
how='inner')
visit_detail_covid["age"] = (
datetime_helpers.substract_datetime(
visit_detail_covid["visit_detail_start_datetime"],
visit_detail_covid["birth_datetime"],
out="hours",
)
/ (24 * 365.25)
)
From distributed Koalas to local Pandas
All the computing above was done using Koalas DataFrames, which are distributed.
Now that we limited our cohort to a manageable size, we can switch to Pandas to finish our analysis.
visit_detail_covid_pd = visit_detail_covid[
["person_id", "age", "HAS_DIABETE", "IS_ICU"]
].to_pandas()
Grouping by patient
stats = (
visit_detail_covid_pd[["person_id", "age", "HAS_DIABETE", "IS_ICU"]]
.groupby("person_id")
.agg(
HAS_DIABETE=("HAS_DIABETE", "any"),
IS_ICU=("IS_ICU", "any"),
age=("age", "min"),
)
)
Binning the age into intervals
stats["age"] = pd.cut(
stats.age,
bins=[0, 40, 50, 60, 70, 120],
labels=["(0, 40]", "(40, 50]", "(50, 60]", "(60, 70]", "(70, 120]"],
)
Computing the ratio of patients that had an ICU visit
stats = stats.groupby(["age", "HAS_DIABETE"], as_index=False).apply(
lambda x: x["IS_ICU"].sum() / len(x)
)
stats.columns = ["age", "cohorte", "percent_icu"]
stats["cohorte"] = stats["cohorte"].replace({True: "Diab.", False: "Control"})
Results
stats
We can finally plot our results using Altair:
import altair as alt
bars = (
alt.Chart(
stats,
title=[
"Percentage of patients who went through ICU during their COVID stay, ",
"as a function of their age range and diabetic status",
" ",
],
)
.mark_bar()
.encode(
x=alt.X("cohorte:N", title=""),
y=alt.Y(
"percent_icu",
title="% of patients who went through ICU.",
axis=alt.Axis(format="%"),
),
color=alt.Color("cohorte:N", title="Cohort"),
column=alt.Column("age:N", title="Age range"),
)
)
bars = bars.configure_title(anchor="middle", baseline="bottom")
bars