Skip to content

You can download this notebook directly here

Tutorial

This tutorial takes you through the entire workflow of the Biology module.

%load_ext autoreload
%autoreload 2
import eds_scikit
import pandas as pd
spark, sc, sql = eds_scikit.improve_performances() # (1)
  1. See the welcome page for an explanation of this line

1. Load Data

First, you need to load your data. As detailed in the dedicated section, eds-scikit is expecting to work with Pandas or Koalas DataFrames. We provide various connectors to facilitate data fetching, namely a Hive connector, a Postgres connector and a Pandas connector.

This tutorial uses the Hive connector.

from eds_scikit.io import HiveData

data = HiveData(
    database_name="cse_XXX",
    tables_to_load=[
        "care_site",
        "concept",
        "concept_relationship",
        "measurement",
        "visit_occurrence",
    ],
)
Number of unique patients: 100000

2. Define your concepts-sets

In order to work on the measurements of interest, you can extract a list of concepts-sets by:

  • Selecting default concepts-sets provided in the library.
  • Modifying the codes of a selected default concepts-set.
  • Creating a concepts-set from scratch.

This tutorial uses all the default concepts-set with an additional custom concepts-set.

from eds_scikit.biology import ConceptsSet


protein_blood = ConceptsSet("Protein_Blood_Quantitative")
protein_urine = ConceptsSet("Protein_Urine_Quantitative")
protein = ConceptsSet(
    name="Protein_Quantitative",
    concept_codes=protein_blood.concept_codes + protein_urine.concept_codes,
)

custom_entity = ConceptsSet(
    name="Custom_entity", concept_codes=["G6616", "I2013", "C2102"]
)

concepts_sets = [
    protein,
    custom_entity,
]

3. Define the configuration

The configuration files does 3 things:

  • Remove outliers
  • Remove unwanted codes
  • Normalize units

3.1 The default configuration

A default configuration is available when working on APHP's CDW. You can access it via:

from eds_scikit.resources import registry

biology_config = registry.get("data", "get_biology_config.all_aphp")()

3.2 Create your own configuration (OPTIONAL)

If this default configuration file does not meet your requirements, you can follow this tutorial to create your own configuration file.
As a reminder, a configuration file is a csv table where each row corresponds to a given standard concept_code and a given unit. For each row, it gives a maximum threshold and a minimum threshold to flag outliers and a unit conversion coefficient to normalize units if needed.

3.2.1 Plot statistical summary

The first step is to compute the statistical summary of each concepts-set with the function plot_biology_summary(stats_only=True).

from eds_scikit.biology import plot_biology_summary

start_date = "2017-01-01"
end_date = "2022-01-01"

plot_biology_summary(
    data,
    concepts_sets=concepts_sets,
    start_date=start_date,
    end_date=end_date,
    stats_only=True,
)

By default, the data will be saved in the Biology_summary folder.

Each ConceptSet will have its own folder. Here, we used, stats_only=True, so

  • No graphical dashboard will be generated
  • Data will not be stratified by care site

Let us display the results for the protein-related ConceptSet:

pd.read_csv("./Biology_summary/Protein_Quantitative/stats_summary.csv")
LOINC_concept_code AnaBio_concept_code LOINC_concept_name AnaBio_concept_name unit_source_value count mean std min 25% 50% 75% max MAD max_threshold min_threshold
0 2885-2 A0249 Prot SerPl-mCnc Protéines_Sérum_g/L g/l 6021 77.286 8.321 24.819 65.504 61.279 85.818 104.826 8.924 103.919 23.073
1 2885-2 A0250 Prot SerPl-mCnc Protéines_Sérum_Electrophorèse_g/L g/l 1176 59.705 7.609 24.735 47.535 84.605 90.445 137.543 7.131 91.838 32.455
2 2885-2 A7347 Prot SerPl-mCnc Protéines_Plasma_g/L g/l 12421 51.113 8.548 22.551 63.876 58.160 77.023 95.262 8.170 86.654 33.378
3 2885-2 B9417 Prot SerPl-mCnc Protéines_Sérum_Colorimétrie_g/L g/l 601 56.906 12.196 32.205 55.820 56.610 69.690 79.671 7.919 121.822 31.160
4 2885-2 C9874 Prot SerPl-mCnc Protéines_Sérum_Electrophorèse 2_g/L g/l 169 54.237 6.402 54.820 51.428 76.413 74.323 84.257 8.145 124.186 34.603
5 2885-2 D0058 Prot SerPl-mCnc Protéines Après dialyse_Sérum/Plasma_g/L g/l 51 64.920 4.699 52.023 71.595 61.444 78.434 76.351 4.502 73.379 39.551
6 2885-2 F2624 Prot SerPl-mCnc Protéines Pédiatrique_Sérum/Plasma_g/L g/l 3 58.934 11.768 45.364 40.882 54.139 59.366 84.880 11.952 77.996 5.854
7 2885-2 F5122 Prot SerPl-mCnc Protéines Duplication A7347_Plasma_g/L g/l 213 80.395 6.134 40.129 69.549 66.730 85.024 110.905 8.824 113.764 38.456
8 2888-6 A1694 Protéines [Masse/Volume] Urine - Numérique Protéines_Urines 24h_g/L g/l 193 2.343 4.262 0.063 0.089 0.257 1.620 52.679 0.162 1.275 0.000
9 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

If you prefer, a HTML table is also generated along with the CSV (same name, but with a .html extension

3.2.2 Create configuration from statistical summary

Then, you can use the function create_config_from_stats() to pre-fill the configuration file with max_threshold and min_threshold. The thresholds computation is based on the Median Absolute Deviation (MAD) Methodology1.

from eds_scikit.biology.utils.config import create_config_from_stats

config_name = "my_custom_config"

create_config_from_stats(
    concepts_sets=concepts_sets,
    config_name=config_name,
)

3.2.3 Edit units manually

The transformed_unit column is pre-filled with the unit that corresponds to the most measurements. When you notice a unit_source_value different than a transformed_unit, it probably means that the concept's unit needs to be normalized.

  • To normalize the unit of a concept you need to fill in manually the Action column with Transform and the Coefficient column with the unit conversion factor.
  • If you consider the concept irrelevant, you can fill in the Action column with Delete and it will delete the measurements corresponding to the concept.
  • If the unit_source_value matches the transformed_unit you can leave the Action and the Coefficient columns empty.

3.2.4 Use your custom configuration

Once you created your configuration (for instance under the name config_name="my_custom_config"), you can use provide it to the relevant functions (see below).

You can also check the configuration file directly:

from eds_scikit.resources import registry
config = registry.get("data", "biology_config.my_custom_config")()

4. Clean the data

Now you can use the bioclean() function with your custom configuration or the default configuration to:

It will add a bioclean table to your data. For more details, have a look on the dedicated section.

from eds_scikit.biology import bioclean

bioclean(
    data,
    concepts_sets=concepts_sets,
    config_name=config_name, # use config_name="all_aphp" for APHP's default configuration
    start_date=start_date,
    end_date=end_date,
)

5. Visualize the statistical summary of clean data

Finally, you can build and save two interactive dashboards and a summary table for each concepts-set. It describes various statistical properties of your clean data.

from eds_scikit.biology import plot_biology_summary

plot_biology_summary(data)

  1. FAIR Health. Application of the mad (median absolute deviation) methodology to exclude extreme data values in fair health products. 2017. URL: https://s3.amazonaws.com/media2.fairhealth.org/resource/asset/FH%20Methodology%20-%20Median%20Absolute%20Deviation.pdf