Skip to content

How to use and developp phenotyping algorithms in eds-scikit

The Phenotype class

Phenotyping is done via the Phenotype class.

Using this class, we can add features that will be stored in the features attribute. Features are DataFrames containing at least a person_id and a phenotype column. Additionaly:

  • If phenotyping at the visit level, features contains a visit_occurrence_id column
  • If using sub-phenotypes (e.g. types of diabetes, or various cancer localiizations), features contains a subphenotype column.

We distinguish 2 main ways of adding features to a Phenotype instance:

  • By querying the database to extract raw features
  • By aggregating one or multiple existing features

Available phenotypes

eds-scikit is shipped with various phenotyping algorithms. For instance, the CancerFromICD10 class can be used to extract visits or patients with a cancer-related ICD10 code. All those phenotyping algorithms share the same API. We will demonstrate it using the CancerFromICD10 class

from eds_scikit.io import HiveData
data = HiveData(DBNAME)
from eds_scikit.phenotype import CancerFromICD10

cancer = CancerFromICD10(data)

To run the phenotyping algorithm, simply run:

data = cancer.to_data()

This will put the resulting phenotype DataFrame in data.computed["CancerFromICD10"]

Most available phenotypes share the same parameters:

PARAMETER DESCRIPTION
data

A BaseData object

TYPE: BaseData

cancer_types

Optional list of cancer types to use for phenotyping

TYPE: Optional[List[str]] DEFAULT: None

level

On which level to do the aggregation, either "patient" or "visit"

TYPE: str DEFAULT: 'patient'

subphenotype

Whether the threshold should apply to the phenotype ("phenotype" column) of the subphenotype ("subphenotype" column)

TYPE: bool DEFAULT: True

threshold

Minimal number of events (which definition depends on the level value)

TYPE: int DEFAULT: 1

Please look into each algorithm's documentation for further specific details.

Implement your own phenotyping algorithm

TO help you implement your own phenotyping algorithm, the Phenotype class exposes method to

  • Easily featch features based on ICD10 and CCAM codes
  • Easily aggregate feature(s) using simple threshold rules

The following paragraph will show how to implement a dummy phenotyping algorithm for moderate to terminal Chronic Kidney Disease (CKD). In short, it will: - Extract patients with ICD10 code for CKD - Extract patients with CCAM code for dialysis or kidney transplant - Aggregate those two feature by keeping patients with both features

We will start by creating an instance of the Phenotype class:

from eds_scikit.phenotype import Phenotype

ckd = Phenotype(data, name="DummyCKD")

Next we define the ICD10 and CCAM codes

Codes formatting

Under the hood, Phenotype will use the conditions_from_icd10 and procedures_from_ccam functions. Check their documentation for details on how to format the provided codes

icd10_codes = {
    "CKD": {"regex": ["N18[345]"]},
}

ccam_codes = {
    "dialysis": {"regex": ["JVJB001"]},
    "transplant": {"exact": ["JAEA003"]},
}

Finally, we can start designing the phenotyping algorithm:

Get ICD10 features

ckd = ckd.add_code_feature(
    output_feature="icd10",
    source="icd10",
    codes=icd10_codes,
)

Get CCAM features

ckd = ckd.add_code_feature(
    output_feature="ccam",
    source="ccam",
    codes=ccam_codes,
)

Aggregate those 2 features

ckd = ckd.agg_two_features(
    input_feature_1="icd10",
    input_feature_2="ccam",
    output_feature="CKD",
    how="AND",
    level="patient",
    subphenotype=False,
    thresholds=(1, 1),
)

The final phenotype DataFrame can now be added to the data object:

data = ckd.to_data()

It will be available under data.computed.CKD

Available methods on Phenotype:

Base class for phenotyping

PARAMETER DESCRIPTION
data

A BaseData object

TYPE: BaseData

name

Name of the phenotype. If left to None, the name of the class will be used instead

TYPE: Optional[str] DEFAULT: None

add_code_feature

add_code_feature(output_feature: str, codes: dict, source: str = 'icd10', additional_filtering: Optional[dict] = None)

Adds a feature from either ICD10 or CCAM codes

PARAMETER DESCRIPTION
output_feature

Name of the feature

TYPE: str

codes

Dictionary of codes to provide to the from_codes function

TYPE: dict

source

Either 'icd10' or 'ccam', by default 'icd10'

TYPE: str DEFAULT: 'icd10'

additional_filtering

Dictionary passed to the from_codes functions for filtering

TYPE: Optional[dict] DEFAULT: None

RETURNS DESCRIPTION
Phenotype

The current Phenotype object with an additional feature stored in self.features[output_feature]

agg_single_feature

agg_single_feature(input_feature: str, output_feature: Optional[str] = None, level: str = 'patient', subphenotype: bool = True, threshold: int = 1) -> Phenotype

Simple aggregation rule on a feature:

  • If level="patient", keeps patients with at least threshold visits showing the (sub)phenotype
  • If level="visit", keeps visits with at least threshold events (could be ICD10 codes, NLP features, biology, etc) showing the (sub)phenotype
PARAMETER DESCRIPTION
input_feature

Name of the input feature

TYPE: str

output_feature

Name of the input feature. If None, will be set to input_feature + "_agg"

TYPE: Optional[str] DEFAULT: None

level

On which level to do the aggregation, either "patient" or "visit"

TYPE: str DEFAULT: 'patient'

subphenotype

Whether the threshold should apply to the phenotype ("phenotype" column) of the subphenotype ("subphenotype" column)

TYPE: bool DEFAULT: True

threshold

Minimal number of events (which definition depends on the level value)

TYPE: int, optional DEFAULT: 1

RETURNS DESCRIPTION
Phenotype

The current Phenotype object with an additional feature stored in self.features[output_feature]

agg_two_features

agg_two_features(input_feature_1: str, input_feature_2: str, output_feature: str = None, how: str = 'AND', level: str = 'patient', subphenotype: bool = True, thresholds: Tuple[int, int] = (1, 1)) -> Phenotype
  • If level='patient', keeps a specific patient if

    • At least thresholds[0] visits are found in feature_1 AND/OR
    • At least thresholds[1] visits are found in feature_2
  • If level='visit', keeps a specific visit if

    • At least thresholds[0] events are found in feature_1 AND/OR
    • At least thresholds[1] events are found in feature_2
PARAMETER DESCRIPTION
input_feature_1

Name of the first input feature

TYPE: str

input_feature_2

Name of the second input feature

TYPE: str

output_feature

Name of the input feature. If None, will be set to input_feature + "_agg"

TYPE: str DEFAULT: None

how

Whether to perform a boolean "AND" or "OR" aggregation

TYPE: str, optional DEFAULT: 'AND'

level

On which level to do the aggregation, either "patient" or "visit"

TYPE: str DEFAULT: 'patient'

subphenotype

Whether the threshold should apply to the phenotype ("phenotype" column) of the subphenotype ("subphenotype" column)

TYPE: bool DEFAULT: True

thresholds

Repsective threshold for the first and second feature

TYPE: Tuple[int, int], optional DEFAULT: (1, 1)

RETURNS DESCRIPTION
Phenotype

The current Phenotype object with an additional feature stored in self.features[output_feature]

compute

compute(**kwargs)

Fetch all necessary features and perform aggregation

to_data

to_data(key: Optional[str] = None) -> BaseData

Appends the feature found in self.features[key] to the data object. If no key is provided, uses the last added feature

PARAMETER DESCRIPTION
key

Key of the self.feature dictionary

TYPE: Optional[str] DEFAULT: None

RETURNS DESCRIPTION
BaseData

The data object with phenotype added to data.computed

Citation

Most available phenotypes implement an algorithm described in an academic paper. When using this algorithm, you can get the BibTex citation of the corrresponding paper by calling the cite method. For instance:

cancer.cite()
@article{kempf2022impact,
  title={Impact of two waves of Sars-Cov2 outbreak on the number, clinical presentation, care trajectories and survival of patients newly referred for a colorectal cancer: A French multicentric cohort study from a large group of University hospitals},
  author={Kempf, Emmanuelle and Priou, Sonia and Lam{\'e}, Guillaume and Daniel, Christel and Bellamine, Ali and Sommacale, Daniele and Belkacemi, Yazid and Bey, Romain and Galula, Gilles and Taright, Namik and others},
  journal={International Journal of Cancer},
  volume={150},
  number={10},
  pages={1609--1618},
  year={2022},
  publisher={Wiley Online Library}
}
Back to top