Saving small cohorts locally
You can download this notebook directly here
Introduction
The goal of this small notebook is to show you how to:
- Work on a big cohort by staying distributed
- Do some phenotyping to select a small subcohort
- Save this subcohort locally to work on it later
As a dummy example, we will select patients that underwent a cardiac transplantation. The selection will be performed by using both ICD-10 and by CCAM terminologies.
Data Loading
import eds_scikit
spark, sc, sql = eds_scikit.improve_performances()
DBNAME="YOUR_DATABASE_NAME"
from eds_scikit.io.hive import HiveData
# Data from Hive
data = HiveData(DBNAME)
Phenotyping
from eds_scikit.event.ccam import procedures_from_ccam
from eds_scikit.event.icd10 import conditions_from_icd10
CCAM = dict(
HEART_TRANSPLANT = dict(
prefix = "DZEA00", #
)
)
ICD10 = dict(
HEART_TRANSPLANT = dict(
exact = "Z941", #
)
)
procedure_occurrence = procedures_from_ccam(
procedure_occurrence=data.procedure_occurrence,
visit_occurrence=data.visit_occurrence,
codes=CCAM,
date_from_visit=True,
)
condition_occurrence = conditions_from_icd10(
condition_occurrence=data.condition_occurrence,
visit_occurrence=data.visit_occurrence,
codes=ICD10,
date_from_visit=True,
additional_filtering=dict(
condition_status_source_value={"DP", "DAS"}, #
)
)
procedure_occurrence.groupby(["concept","value"]).size()
condition_occurrence.groupby(["concept","value"]).size()
Saving to disk
cohort = set(
procedure_occurrence.person_id.to_list() + condition_occurrence.person_id.to_list()
)
We can check that our cohort is indeed small and can be stored locally without any concerns:
len(cohort)
And we can also compute a very crude prevalence of heart transplant in our database:
f"{100 * len(cohort)/len(set(data.procedure_occurrence.person_id.to_list() + data.condition_occurrence.person_id.to_list())):.5f} %"
Finally let us save the tables we need locally.
Under the hood, eds-scikit will only keep data corresponding to the provided cohort.
import os
folder = os.path.abspath("./heart_transplant_cohort")
tables_to_save = [
"person",
"visit_detail",
"visit_occurrence",
"procedure_occurrence",
"condition_occurrence",
]
data.persist_tables_to_folder(
folder,
tables=tables_to_save,
person_ids=cohort,
)
Using the saved cohort
Now that our cohort is saved locally, it can be accessed directly by using the PandasData
class.
Its akin to the HiveData
class, except that the loaded tables will be stored directly as Pandas DataFrames, allowing for faster and easier analysis
from eds_scikit.io.files import PandasData
data = PandasData(folder)
As a sanity check, let us display the number of patient in our saved cohort (we are expecting 30)
cohort = data.person.person_id.to_list()
len(cohort)
And the crude prevalence that should now be 100% !
f"{100 * len(cohort)/len(set(data.procedure_occurrence.person_id.to_list() + data.condition_occurrence.person_id.to_list())):.5f} %"