Saving small cohorts locally

You can download this notebook directly here

Introduction

The goal of this small notebook is to show you how to:

Work on a big cohort by staying distributed
Do some phenotyping to select a small subcohort
Save this subcohort locally to work on it later

As a dummy example, we will select patients that underwent a cardiac transplantation. The selection will be performed by using both ICD-10 and by CCAM terminologies.

Data Loading

import eds_scikit

spark, sc, sql = eds_scikit.improve_performances()

DBNAME="YOUR_DATABASE_NAME"

from eds_scikit.io.hive import HiveData

# Data from Hive
data = HiveData(DBNAME)

Phenotyping

from eds_scikit.event.ccam import procedures_from_ccam
from eds_scikit.event.icd10 import conditions_from_icd10

CCAM = dict(
    HEART_TRANSPLANT = dict(
        prefix = "DZEA00", # 
    )
)

ICD10 = dict(
    HEART_TRANSPLANT = dict(
        exact = "Z941", # 
    )
)

procedure_occurrence = procedures_from_ccam(
    procedure_occurrence=data.procedure_occurrence,
    visit_occurrence=data.visit_occurrence,
    codes=CCAM,
    date_from_visit=True,
)

condition_occurrence = conditions_from_icd10(
    condition_occurrence=data.condition_occurrence,
    visit_occurrence=data.visit_occurrence,
    codes=ICD10,
    date_from_visit=True,
    additional_filtering=dict(
        condition_status_source_value={"DP", "DAS"}, # 
    )
)

procedure_occurrence.groupby(["concept","value"]).size()

concept           value  
HEART_TRANSPLANT  DZEA002    39
dtype: int64

condition_occurrence.groupby(["concept","value"]).size()

concept           value
HEART_TRANSPLANT  Z941     602
dtype: int64

Saving to disk

cohort = set(
    procedure_occurrence.person_id.to_list() + condition_occurrence.person_id.to_list()
)

We can check that our cohort is indeed small and can be stored locally without any concerns:

len(cohort)

And we can also compute a very crude prevalence of heart transplant in our database:

f"{100 * len(cohort)/len(set(data.procedure_occurrence.person_id.to_list() + data.condition_occurrence.person_id.to_list())):.5f} %"

'0.06849 %'

Finally let us save the tables we need locally.
Under the hood, eds-scikit will only keep data corresponding to the provided cohort.

import os

folder = os.path.abspath("./heart_transplant_cohort")

tables_to_save = [
    "person",
    "visit_detail",
    "visit_occurrence",
    "procedure_occurrence",
    "condition_occurrence",
]

data.persist_tables_to_folder(
    folder,
    tables=tables_to_save,
    person_ids=cohort,
)

Number of unique patients: 53
writing /export/home/cse210038/Thomas/eds-scikit/docs/recipes/heart_transplant_cohort/person.parquet
writing /export/home/cse210038/Thomas/eds-scikit/docs/recipes/heart_transplant_cohort/visit_detail.parquet
writing /export/home/cse210038/Thomas/eds-scikit/docs/recipes/heart_transplant_cohort/visit_occurrence.parquet
writing /export/home/cse210038/Thomas/eds-scikit/docs/recipes/heart_transplant_cohort/procedure_occurrence.parquet
writing /export/home/cse210038/Thomas/eds-scikit/docs/recipes/heart_transplant_cohort/condition_occurrence.parquet

Using the saved cohort

Now that our cohort is saved locally, it can be accessed directly by using the PandasData class.
Its akin to the HiveData class, except that the loaded tables will be stored directly as Pandas DataFrames, allowing for faster and easier analysis

from eds_scikit.io.files import PandasData

data = PandasData(folder)

As a sanity check, let us display the number of patient in our saved cohort (we are expecting 30)

cohort = data.person.person_id.to_list()
len(cohort)

And the crude prevalence that should now be 100% !

f"{100 * len(cohort)/len(set(data.procedure_occurrence.person_id.to_list() + data.condition_occurrence.person_id.to_list())):.5f} %"

'100.00000 %'