How to use and developp phenotyping algorithms in `eds-scikit`

The `Phenotype` class

Phenotyping is done via the Phenotype class.

Using this class, we can add features that will be stored in the features attribute. Features are DataFrames containing at least a person_id and a phenotype column. Additionaly:

If phenotyping at the visit level, features contains a visit_occurrence_id column
If using sub-phenotypes (e.g. types of diabetes, or various cancer localiizations), features contains a subphenotype column.

We distinguish 2 main ways of adding features to a Phenotype instance:

By querying the database to extract raw features
By aggregating one or multiple existing features

Available phenotypes

eds-scikit is shipped with various phenotyping algorithms. For instance, the CancerFromICD10 class can be used to extract visits or patients with a cancer-related ICD10 code. All those phenotyping algorithms share the same API. We will demonstrate it using the CancerFromICD10 class

from eds_scikit.io import HiveData
data = HiveData(DBNAME)

from eds_scikit.phenotype import CancerFromICD10

cancer = CancerFromICD10(data)

To run the phenotyping algorithm, simply run:

data = cancer.to_data()

This will put the resulting phenotype DataFrame in data.computed["CancerFromICD10"]

Most available phenotypes share the same parameters:

PARAMETER	DESCRIPTION
`data`	A BaseData object TYPE: `BaseData`
`cancer_types`	Optional list of cancer types to use for phenotyping TYPE: `Optional[List[str]]` DEFAULT: `None`
`level`	On which level to do the aggregation, either "patient" or "visit" TYPE: `str` DEFAULT: `'patient'`
`subphenotype`	Whether the threshold should apply to the phenotype ("phenotype" column) of the subphenotype ("subphenotype" column) TYPE: `bool` DEFAULT: `True`
`threshold`	Minimal number of events (which definition depends on the `level` value) TYPE: `int` DEFAULT: `1`

Please look into each algorithm's documentation for further specific details.

Implement your own phenotyping algorithm

TO help you implement your own phenotyping algorithm, the Phenotype class exposes method to

Easily featch features based on ICD10 and CCAM codes
Easily aggregate feature(s) using simple threshold rules

The following paragraph will show how to implement a dummy phenotyping algorithm for moderate to terminal Chronic Kidney Disease (CKD). In short, it will: - Extract patients with ICD10 code for CKD - Extract patients with CCAM code for dialysis or kidney transplant - Aggregate those two feature by keeping patients with both features

We will start by creating an instance of the Phenotype class:

from eds_scikit.phenotype import Phenotype

ckd = Phenotype(data, name="DummyCKD")

Next we define the ICD10 and CCAM codes

Codes formatting

Under the hood, Phenotype will use the conditions_from_icd10 and procedures_from_ccam functions. Check their documentation for details on how to format the provided codes

icd10_codes = {
    "CKD": {"regex": ["N18[345]"]},
}

ccam_codes = {
    "dialysis": {"regex": ["JVJB001"]},
    "transplant": {"exact": ["JAEA003"]},
}

Finally, we can start designing the phenotyping algorithm:

Get ICD10 features

ckd = ckd.add_code_feature(
    output_feature="icd10",
    source="icd10",
    codes=icd10_codes,
)

Get CCAM features

ckd = ckd.add_code_feature(
    output_feature="ccam",
    source="ccam",
    codes=ccam_codes,
)

Aggregate those 2 features

ckd = ckd.agg_two_features(
    input_feature_1="icd10",
    input_feature_2="ccam",
    output_feature="CKD",
    how="AND",
    level="patient",
    subphenotype=False,
    thresholds=(1, 1),
)

The final phenotype DataFrame can now be added to the data object:

data = ckd.to_data()

It will be available under data.computed.CKD

Available methods on `Phenotype`:

Base class for phenotyping

PARAMETER DESCRIPTION

data

A BaseData object

TYPE: BaseData

name

Name of the phenotype. If left to None, the name of the class will be used instead

TYPE: Optional[str] DEFAULT: None

add_code_feature

add_code_feature(output_feature: str, codes: dict, source: str = 'icd10', additional_filtering: Optional[dict] = None)

Adds a feature from either ICD10 or CCAM codes

PARAMETER	DESCRIPTION
`output_feature`	Name of the feature TYPE: `str`
`codes`	Dictionary of codes to provide to the `from_codes` function TYPE: `dict`
`source`	Either 'icd10' or 'ccam', by default 'icd10' TYPE: `str` DEFAULT: `'icd10'`
`additional_filtering`	Dictionary passed to the `from_codes` functions for filtering TYPE: `Optional[dict]` DEFAULT: `None`

RETURNS	DESCRIPTION
`Phenotype`	The current Phenotype object with an additional feature stored in self.features[output_feature]

agg_single_feature

agg_single_feature(input_feature: str, output_feature: Optional[str] = None, level: str = 'patient', subphenotype: bool = True, threshold: int = 1) -> Phenotype

Simple aggregation rule on a feature:

If level="patient", keeps patients with at least threshold visits showing the (sub)phenotype
If level="visit", keeps visits with at least threshold events (could be ICD10 codes, NLP features, biology, etc) showing the (sub)phenotype

PARAMETER	DESCRIPTION
`input_feature`	Name of the input feature TYPE: `str`
`output_feature`	Name of the input feature. If None, will be set to input_feature + "_agg" TYPE: `Optional[str]` DEFAULT: `None`
`level`	On which level to do the aggregation, either "patient" or "visit" TYPE: `str` DEFAULT: `'patient'`
`subphenotype`	Whether the threshold should apply to the phenotype ("phenotype" column) of the subphenotype ("subphenotype" column) TYPE: `bool` DEFAULT: `True`
`threshold`	Minimal number of events (which definition depends on the `level` value) TYPE: `int, optional` DEFAULT: `1`

RETURNS	DESCRIPTION
`Phenotype`	The current Phenotype object with an additional feature stored in self.features[output_feature]

agg_two_features

agg_two_features(input_feature_1: str, input_feature_2: str, output_feature: str = None, how: str = 'AND', level: str = 'patient', subphenotype: bool = True, thresholds: Tuple[int, int] = (1, 1)) -> Phenotype

If level='patient', keeps a specific patient if
- At least thresholds[0] visits are found in feature_1 AND/OR
- At least thresholds[1] visits are found in feature_2
If level='visit', keeps a specific visit if
- At least thresholds[0] events are found in feature_1 AND/OR
- At least thresholds[1] events are found in feature_2

PARAMETER	DESCRIPTION
`input_feature_1`	Name of the first input feature TYPE: `str`
`input_feature_2`	Name of the second input feature TYPE: `str`
`output_feature`	Name of the input feature. If None, will be set to input_feature + "_agg" TYPE: `str` DEFAULT: `None`
`how`	Whether to perform a boolean "AND" or "OR" aggregation TYPE: `str, optional` DEFAULT: `'AND'`
`level`	On which level to do the aggregation, either "patient" or "visit" TYPE: `str` DEFAULT: `'patient'`
`subphenotype`	Whether the threshold should apply to the phenotype ("phenotype" column) of the subphenotype ("subphenotype" column) TYPE: `bool` DEFAULT: `True`
`thresholds`	Repsective threshold for the first and second feature TYPE: `Tuple[int, int], optional` DEFAULT: `(1, 1)`

RETURNS	DESCRIPTION
`Phenotype`	The current Phenotype object with an additional feature stored in self.features[output_feature]

compute

compute(**kwargs)

Fetch all necessary features and perform aggregation

to_data

to_data(key: Optional[str] = None) -> BaseData

Appends the feature found in self.features[key] to the data object. If no key is provided, uses the last added feature

PARAMETER DESCRIPTION

key

Key of the self.feature dictionary

TYPE: Optional[str] DEFAULT: None

RETURNS	DESCRIPTION
`BaseData`	The data object with phenotype added to `data.computed`

Citation

Most available phenotypes implement an algorithm described in an academic paper. When using this algorithm, you can get the BibTex citation of the corrresponding paper by calling the cite method. For instance:

cancer.cite()

@article{kempf2022impact,
  title={Impact of two waves of Sars-Cov2 outbreak on the number, clinical presentation, care trajectories and survival of patients newly referred for a colorectal cancer: A French multicentric cohort study from a large group of University hospitals},
  author={Kempf, Emmanuelle and Priou, Sonia and Lam{\'e}, Guillaume and Daniel, Christel and Bellamine, Ali and Sommacale, Daniele and Belkacemi, Yazid and Bey, Romain and Galula, Gilles and Taright, Namik and others},
  journal={International Journal of Cancer},
  volume={150},
  number={10},
  pages={1609--1618},
  year={2022},
  publisher={Wiley Online Library}
}

How to use and developp phenotyping algorithms in eds-scikit

The Phenotype class