How to use and developp phenotyping algorithms in eds-scikit
The Phenotype
class
Phenotyping is done via the Phenotype
class.
Using this class, we can add features
that will be stored in the features
attribute.
Features are DataFrames containing at least a person_id
and a phenotype
column. Additionaly:
- If phenotyping at the visit level, features contains a
visit_occurrence_id
column - If using sub-phenotypes (e.g. types of diabetes, or various cancer localiizations), features contains a
subphenotype
column.
We distinguish 2 main ways of adding features to a Phenotype
instance:
- By querying the database to extract raw features
- By aggregating one or multiple existing features
Available phenotypes
eds-scikit
is shipped with various phenotyping algorithms. For instance, the CancerFromICD10 class can be used to extract visits or patients with a cancer-related ICD10 code. All those phenotyping algorithms share the same API. We will demonstrate it using the CancerFromICD10
class
from eds_scikit.io import HiveData
data = HiveData(DBNAME)
from eds_scikit.phenotype import CancerFromICD10
cancer = CancerFromICD10(data)
To run the phenotyping algorithm, simply run:
data = cancer.to_data()
This will put the resulting phenotype DataFrame in data.computed["CancerFromICD10"]
Most available phenotypes share the same parameters:
PARAMETER | DESCRIPTION |
---|---|
data |
A BaseData object
TYPE:
|
cancer_types |
Optional list of cancer types to use for phenotyping
TYPE:
|
level |
On which level to do the aggregation, either "patient" or "visit"
TYPE:
|
subphenotype |
Whether the threshold should apply to the phenotype ("phenotype" column) of the subphenotype ("subphenotype" column)
TYPE:
|
threshold |
Minimal number of events (which definition depends on the
TYPE:
|
Please look into each algorithm's documentation for further specific details.
Implement your own phenotyping algorithm
TO help you implement your own phenotyping algorithm, the Phenotype
class exposes method to
- Easily featch features based on ICD10 and CCAM codes
- Easily aggregate feature(s) using simple threshold rules
The following paragraph will show how to implement a dummy phenotyping algorithm for moderate to terminal Chronic Kidney Disease (CKD). In short, it will: - Extract patients with ICD10 code for CKD - Extract patients with CCAM code for dialysis or kidney transplant - Aggregate those two feature by keeping patients with both features
We will start by creating an instance of the Phenotype
class:
from eds_scikit.phenotype import Phenotype
ckd = Phenotype(data, name="DummyCKD")
Next we define the ICD10 and CCAM codes
Codes formatting
Under the hood, Phenotype
will use the conditions_from_icd10 and procedures_from_ccam functions. Check their documentation for details on how to format the provided codes
icd10_codes = {
"CKD": {"regex": ["N18[345]"]},
}
ccam_codes = {
"dialysis": {"regex": ["JVJB001"]},
"transplant": {"exact": ["JAEA003"]},
}
Finally, we can start designing the phenotyping algorithm:
Get ICD10 features
ckd = ckd.add_code_feature(
output_feature="icd10",
source="icd10",
codes=icd10_codes,
)
Get CCAM features
ckd = ckd.add_code_feature(
output_feature="ccam",
source="ccam",
codes=ccam_codes,
)
Aggregate those 2 features
ckd = ckd.agg_two_features(
input_feature_1="icd10",
input_feature_2="ccam",
output_feature="CKD",
how="AND",
level="patient",
subphenotype=False,
thresholds=(1, 1),
)
The final phenotype DataFrame can now be added to the data
object:
data = ckd.to_data()
It will be available under data.computed.CKD
Available methods on Phenotype
:
Base class for phenotyping
PARAMETER | DESCRIPTION |
---|---|
data |
A BaseData object
TYPE:
|
name |
Name of the phenotype. If left to None, the name of the class will be used instead
TYPE:
|
add_code_feature
add_code_feature(output_feature: str, codes: dict, source: str = 'icd10', additional_filtering: Optional[dict] = None)
Adds a feature from either ICD10 or CCAM codes
PARAMETER | DESCRIPTION |
---|---|
output_feature |
Name of the feature
TYPE:
|
codes |
Dictionary of codes to provide to the
TYPE:
|
source |
Either 'icd10' or 'ccam', by default 'icd10'
TYPE:
|
additional_filtering |
Dictionary passed to the
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Phenotype
|
The current Phenotype object with an additional feature stored in self.features[output_feature] |
agg_single_feature
agg_single_feature(input_feature: str, output_feature: Optional[str] = None, level: str = 'patient', subphenotype: bool = True, threshold: int = 1) -> Phenotype
Simple aggregation rule on a feature:
- If level="patient", keeps patients with at least
threshold
visits showing the (sub)phenotype - If level="visit", keeps visits with at least
threshold
events (could be ICD10 codes, NLP features, biology, etc) showing the (sub)phenotype
PARAMETER | DESCRIPTION |
---|---|
input_feature |
Name of the input feature
TYPE:
|
output_feature |
Name of the input feature. If None, will be set to input_feature + "_agg"
TYPE:
|
level |
On which level to do the aggregation, either "patient" or "visit"
TYPE:
|
subphenotype |
Whether the threshold should apply to the phenotype ("phenotype" column) of the subphenotype ("subphenotype" column)
TYPE:
|
threshold |
Minimal number of events (which definition depends on the
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Phenotype
|
The current Phenotype object with an additional feature stored in self.features[output_feature] |
agg_two_features
agg_two_features(input_feature_1: str, input_feature_2: str, output_feature: str = None, how: str = 'AND', level: str = 'patient', subphenotype: bool = True, thresholds: Tuple[int, int] = (1, 1)) -> Phenotype
-
If level='patient', keeps a specific patient if
- At least
thresholds[0]
visits are found in feature_1 AND/OR - At least
thresholds[1]
visits are found in feature_2
- At least
-
If level='visit', keeps a specific visit if
- At least
thresholds[0]
events are found in feature_1 AND/OR - At least
thresholds[1]
events are found in feature_2
- At least
PARAMETER | DESCRIPTION |
---|---|
input_feature_1 |
Name of the first input feature
TYPE:
|
input_feature_2 |
Name of the second input feature
TYPE:
|
output_feature |
Name of the input feature. If None, will be set to input_feature + "_agg"
TYPE:
|
how |
Whether to perform a boolean "AND" or "OR" aggregation
TYPE:
|
level |
On which level to do the aggregation, either "patient" or "visit"
TYPE:
|
subphenotype |
Whether the threshold should apply to the phenotype ("phenotype" column) of the subphenotype ("subphenotype" column)
TYPE:
|
thresholds |
Repsective threshold for the first and second feature
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Phenotype
|
The current Phenotype object with an additional feature stored in self.features[output_feature] |
compute
compute(**kwargs)
Fetch all necessary features and perform aggregation
to_data
to_data(key: Optional[str] = None) -> BaseData
Appends the feature found in self.features[key] to the data object. If no key is provided, uses the last added feature
PARAMETER | DESCRIPTION |
---|---|
key |
Key of the self.feature dictionary
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
BaseData
|
The data object with phenotype added to |
Citation
Most available phenotypes implement an algorithm described in an academic paper. When using this algorithm, you can get the BibTex citation of the corrresponding paper by calling the cite
method. For instance:
cancer.cite()
@article{kempf2022impact,
title={Impact of two waves of Sars-Cov2 outbreak on the number, clinical presentation, care trajectories and survival of patients newly referred for a colorectal cancer: A French multicentric cohort study from a large group of University hospitals},
author={Kempf, Emmanuelle and Priou, Sonia and Lam{\'e}, Guillaume and Daniel, Christel and Bellamine, Ali and Sommacale, Daniele and Belkacemi, Yazid and Bey, Romain and Galula, Gilles and Taright, Namik and others},
journal={International Journal of Cancer},
volume={150},
number={10},
pages={1609--1618},
year={2022},
publisher={Wiley Online Library}
}