You can download this notebook directly here
A gentle demo
import datetime
import pandas as pd
import eds_scikit
spark, sc, sql = eds_scikit.improve_performances() # (1)
- See the welcome page for an explanation of this line
Loading data
Data loading is made easy by using the HiveData
object.
Simply give it the name of the database you want to use:
database_name = "MY_DATABASE_NAME"
from eds_scikit.io import HiveData
data = HiveData(
database_name="database_name",
)
Now your tables are available as Koalas DataFrames: Those are basically Spark DataFrames which allows for the Pandas API to be used on top (see the Project description of eds-scikit's documentation for more informations.)
What we need to extract:
- Patients with diabetes
- Patients with Covid-19
- Visits from those patients, and their ICU/Non-ICU status
Let us import what's necessary from eds-scikit:
from eds_scikit.event import conditions_from_icd10
from eds_scikit.event.diabetes import (
diabetes_from_icd10,
DEFAULT_DIABETE_FROM_ICD10_CONFIG,
)
from eds_scikit.icu import tag_icu_visit
DATE_MIN = datetime.datetime(2020, 1, 1)
DATE_MAX = datetime.datetime(2021, 6, 1)
Extracting the diabetic status
Luckily, a function is available to extract diabetic patients from ICD-10:
diabetes = diabetes_from_icd10(
condition_occurrence=data.condition_occurrence,
visit_occurrence=data.visit_occurrence,
date_min=DATE_MIN,
date_max=DATE_MAX,
)
We can check the default parameters used here:
DEFAULT_DIABETE_FROM_ICD10_CONFIG
We are only interested in diabetes mellitus, although we extracted other types of diabetes:
diabetes.concept.value_counts()
We will restrict the types of diabetes used here:
diabetes_cohort = (
diabetes[
diabetes.concept.isin(
{
"DIABETES_TYPE_I",
"DIABETES_TYPE_II",
"OTHER_DIABETES_MELLITUS",
}
)
]
.person_id.unique()
.reset_index()
)
diabetes_cohort.loc[:, "HAS_DIABETE"] = True
Extracting the Covid status
Using the conditions_from_icd10
function, we will extract visits linked to COVID-19:
codes = dict(
COVID=dict(
code_list=r"U071[0145]",
code_type="regex",
)
)
covid = conditions_from_icd10(
condition_occurrence=data.condition_occurrence,
visit_occurrence=data.visit_occurrence,
codes=codes,
date_min=DATE_MIN,
date_max=DATE_MAX,
)
Now we can go from the visit_occurrence
level to the visit_detail
level.
visit_detail_covid = data.visit_detail.merge(
covid[["visit_occurrence_id"]],
on="visit_occurrence_id",
how="inner",
)
Extracting ICU visits
What is left to do is to tag each visit as occurring in an ICU or not. This is achieved with the tag_icu_visit
.
Like many functions in eds-scikit, this function exposes an algo
argument, allowing you to choose how the tagging is done.
You can check the corresponding documentation to see the availables algos
.
visit_detail_covid = tag_icu_visit(
visit_detail=visit_detail_covid,
care_site=data.care_site,
algo="from_authorisation_type",
)
visit_detail_covid = visit_detail_covid.merge(
diabetes_cohort, on="person_id", how="left"
)
visit_detail_covid["HAS_DIABETE"].fillna(False, inplace=True)
visit_detail_covid["IS_ICU"].fillna(False, inplace=True)
Finishing the analysis
Adding patient's age
We will add the patient's age at each visit_detail
:
from eds_scikit.utils import datetime_helpers
visit_detail_covid = visit_detail_covid.merge(data.person[['person_id','birth_datetime']],
on='person_id',
how='inner')
visit_detail_covid["age"] = (
datetime_helpers.substract_datetime(
visit_detail_covid["visit_detail_start_datetime"],
visit_detail_covid["birth_datetime"],
out="hours",
)
/ (24 * 365.25)
)
From distributed Koalas to local Pandas
All the computing above was done using Koalas DataFrames, which are distributed.
Now that we limited our cohort to a manageable size, we can switch to Pandas to finish our analysis.
visit_detail_covid_pd = visit_detail_covid[
["person_id", "age", "HAS_DIABETE", "IS_ICU"]
].to_pandas()
Grouping by patient
stats = (
visit_detail_covid_pd[["person_id", "age", "HAS_DIABETE", "IS_ICU"]]
.groupby("person_id")
.agg(
HAS_DIABETE=("HAS_DIABETE", "any"),
IS_ICU=("IS_ICU", "any"),
age=("age", "min"),
)
)
Binning the age into intervals
stats["age"] = pd.cut(
stats.age,
bins=[0, 40, 50, 60, 70, 120],
labels=["(0, 40]", "(40, 50]", "(50, 60]", "(60, 70]", "(70, 120]"],
)
Computing the ratio of patients that had an ICU visit
stats = stats.groupby(["age", "HAS_DIABETE"], as_index=False).apply(
lambda x: x["IS_ICU"].sum() / len(x)
)
stats.columns = ["age", "cohorte", "percent_icu"]
stats["cohorte"] = stats["cohorte"].replace({True: "Diab.", False: "Control"})
Results
stats
We can finally plot our results using Altair:
import altair as alt
bars = (
alt.Chart(
stats,
title=[
"Percentage of patients who went through ICU during their COVID stay, ",
"as a function of their age range and diabetic status",
" ",
],
)
.mark_bar()
.encode(
x=alt.X("cohorte:N", title=""),
y=alt.Y(
"percent_icu",
title="% of patients who went through ICU.",
axis=alt.Axis(format="%"),
),
color=alt.Color("cohorte:N", title="Cohort"),
column=alt.Column("age:N", title="Age range"),
)
)
bars = bars.configure_title(anchor="middle", baseline="bottom")
bars