You can download this notebook directly here

Tutorial

This tutorial takes you through the entire workflow of the Biology module.

%load_ext autoreload
%autoreload 2

import eds_scikit
import pandas as pd

spark, sc, sql = eds_scikit.improve_performances() # (1)

See the welcome page for an explanation of this line

1. Load Data

First, you need to load your data. As detailed in the dedicated section, eds-scikit is expecting to work with Pandas or Koalas DataFrames. We provide various connectors to facilitate data fetching, namely a Hive connector, a Postgres connector and a Pandas connector.

This tutorial uses the Hive connector.

from eds_scikit.io import HiveData

data = HiveData(
    database_name="cse_XXX",
    tables_to_load=[
        "care_site",
        "concept",
        "concept_relationship",
        "measurement",
        "visit_occurrence",
    ],
)

Number of unique patients: 100000

2. Define your concepts-sets

In order to work on the measurements of interest, you can extract a list of concepts-sets by:

Selecting default concepts-sets provided in the library.
Modifying the codes of a selected default concepts-set.
Creating a concepts-set from scratch.

This tutorial uses all the default concepts-set with an additional custom concepts-set.

from eds_scikit.biology import ConceptsSet


protein_blood = ConceptsSet("Protein_Blood_Quantitative")
protein_urine = ConceptsSet("Protein_Urine_Quantitative")
protein = ConceptsSet(
    name="Protein_Quantitative",
    concept_codes=protein_blood.concept_codes + protein_urine.concept_codes,
)

custom_entity = ConceptsSet(
    name="Custom_entity", concept_codes=["G6616", "I2013", "C2102"]
)

concepts_sets = [
    protein,
    custom_entity,
]

3. Define the configuration

The configuration files does 3 things:

Remove outliers
Remove unwanted codes
Normalize units

3.1 The default configuration

A default configuration is available when working on APHP's CDW. You can access it via:

from eds_scikit.resources import registry

biology_config = registry.get("data", "get_biology_config.all_aphp")()

3.2 Create your own configuration (OPTIONAL)

If this default configuration file does not meet your requirements, you can follow this tutorial to create your own configuration file.
As a reminder, a configuration file is a csv table where each row corresponds to a given standard concept_code and a given unit. For each row, it gives a maximum threshold and a minimum threshold to flag outliers and a unit conversion coefficient to normalize units if needed.

3.2.1 Plot statistical summary

The first step is to compute the statistical summary of each concepts-set with the function plot_biology_summary(stats_only=True).

from eds_scikit.biology import plot_biology_summary

start_date = "2017-01-01"
end_date = "2022-01-01"

plot_biology_summary(
    data,
    concepts_sets=concepts_sets,
    start_date=start_date,
    end_date=end_date,
    stats_only=True,
)

By default, the data will be saved in the Biology_summary folder.

Each ConceptSet will have its own folder. Here, we used, stats_only=True, so

No graphical dashboard will be generated
Data will not be stratified by care site

Let us display the results for the protein-related ConceptSet:

pd.read_csv("./Biology_summary/Protein_Quantitative/stats_summary.csv")

	LOINC_concept_code	AnaBio_concept_code	LOINC_concept_name	AnaBio_concept_name	unit_source_value	count	mean	std	min	25%	50%	75%	max	MAD	max_threshold	min_threshold
0	2885-2	A0249	Prot SerPl-mCnc	Protéines_Sérum_g/L	g/l	6021	77.286	8.321	24.819	65.504	61.279	85.818	104.826	8.924	103.919	23.073
1	2885-2	A0250	Prot SerPl-mCnc	Protéines_Sérum_Electrophorèse_g/L	g/l	1176	59.705	7.609	24.735	47.535	84.605	90.445	137.543	7.131	91.838	32.455
2	2885-2	A7347	Prot SerPl-mCnc	Protéines_Plasma_g/L	g/l	12421	51.113	8.548	22.551	63.876	58.160	77.023	95.262	8.170	86.654	33.378
3	2885-2	B9417	Prot SerPl-mCnc	Protéines_Sérum_Colorimétrie_g/L	g/l	601	56.906	12.196	32.205	55.820	56.610	69.690	79.671	7.919	121.822	31.160
4	2885-2	C9874	Prot SerPl-mCnc	Protéines_Sérum_Electrophorèse 2_g/L	g/l	169	54.237	6.402	54.820	51.428	76.413	74.323	84.257	8.145	124.186	34.603
5	2885-2	D0058	Prot SerPl-mCnc	Protéines Après dialyse_Sérum/Plasma_g/L	g/l	51	64.920	4.699	52.023	71.595	61.444	78.434	76.351	4.502	73.379	39.551
6	2885-2	F2624	Prot SerPl-mCnc	Protéines Pédiatrique_Sérum/Plasma_g/L	g/l	3	58.934	11.768	45.364	40.882	54.139	59.366	84.880	11.952	77.996	5.854
7	2885-2	F5122	Prot SerPl-mCnc	Protéines Duplication A7347_Plasma_g/L	g/l	213	80.395	6.134	40.129	69.549	66.730	85.024	110.905	8.824	113.764	38.456
8	2888-6	A1694	Protéines [Masse/Volume] Urine - Numérique	Protéines_Urines 24h_g/L	g/l	193	2.343	4.262	0.063	0.089	0.257	1.620	52.679	0.162	1.275	0.000
9	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...

If you prefer, a HTML table is also generated along with the CSV (same name, but with a .html extension

3.2.2 Create configuration from statistical summary

Then, you can use the function create_config_from_stats() to pre-fill the configuration file with max_threshold and min_threshold. The thresholds computation is based on the Median Absolute Deviation (MAD) Methodology¹.

from eds_scikit.biology.utils.config import create_config_from_stats

config_name = "my_custom_config"

create_config_from_stats(
    concepts_sets=concepts_sets,
    config_name=config_name,
)

3.2.3 Edit units manually

The transformed_unit column is pre-filled with the unit that corresponds to the most measurements. When you notice a unit_source_value different than a transformed_unit, it probably means that the concept's unit needs to be normalized.

To normalize the unit of a concept you need to fill in manually the Action column with Transform and the Coefficient column with the unit conversion factor.
If you consider the concept irrelevant, you can fill in the Action column with Delete and it will delete the measurements corresponding to the concept.
If the unit_source_value matches the transformed_unit you can leave the Action and the Coefficient columns empty.

3.2.4 Use your custom configuration

Once you created your configuration (for instance under the name config_name="my_custom_config"), you can use provide it to the relevant functions (see below).

You can also check the configuration file directly:

from eds_scikit.resources import registry
config = registry.get("data", "biology_config.my_custom_config")()

4. Clean the data

Now you can use the bioclean() function with your custom configuration or the default configuration to:

It will add a bioclean table to your data. For more details, have a look on the dedicated section.

from eds_scikit.biology import bioclean

bioclean(
    data,
    concepts_sets=concepts_sets,
    config_name=config_name, # use config_name="all_aphp" for APHP's default configuration
    start_date=start_date,
    end_date=end_date,
)

See below the columns created by the bioclean() function:

concepts_set	LOINC_concept_code	LOINC_concept_name	AnaBio_concept_code	AnaBio_concept_name	transformed_unit	transformed_value	max_threshold	min_threshold	outlier	value_source_value	unit_source_value
EntityA_Blood_Quantitative	000-0	EntityA #Bld	A0000	EntityA_Blood	x10*9/l	115	190	0	False	115 x10*9/l	x10*9/l
EntityA_Blood_Quantitative	000-1	EntityA_Blood_Vol	A0001	EntityA_Blood_g/l	x10*9/l	220	190	0	True	560 g/l	g/l
EntityB_Blood_Quantitative	001-0	EntityB_Blood	B0000	EntityB_Blood_artery	mmol	0.45	8.548	0.542	True	0.45 mmol	mmol
EntityB_Blood_Quantitative	001-0	EntityB_Blood	B0001	EntityB_Blood_vein	mmol	4.52	8.548	0.542	False	4.52 mmol	mmol
...	...	...	...	...	...	...	...	...	...	...	...

5. Visualize the statistical summary of clean data

Finally, you can build and save two interactive dashboards and a summary table for each concepts-set. It describes various statistical properties of your clean data.

from eds_scikit.biology import plot_biology_summary

plot_biology_summary(data)

Please see below some examples:

FAIR Health. Application of the mad (median absolute deviation) methodology to exclude extreme data values in fair health products. 2017. URL: https://s3.amazonaws.com/media2.fairhealth.org/resource/asset/FH%20Methodology%20-%20Median%20Absolute%20Deviation.pdf. ↩