Skip to content

Cleaning

The pipeline is structured in 3 stages:

Image title

Definitions

The BioClean module focuses on two OMOP terms:

  • measurement is a record obtained through the standardized testing or examination of a person or person's sample. It corresponds to a row in the Measurement table.
  • concept is a semantic notion that uniquely identify a clinical event. It can group several measurements.

A third term was created to ease the use of the two above:

  • concepts-set is a generic concept that has been deemed appropriate for most biological analyses. It is a group of several biological concepts representing the same biological entity.

Example:
Let's imagine the laboratory X tests the creatinine of Mister A and Mister B in mg/dL and the laboratory Y tests the creatinine of Mister C in µmol/L. In this context, the dataset will contain:

  • 3 measurements (one for each conducted test)
  • 2 concepts (one concept for the creatinine tested in mg/dL and another one for the creatinine tested in µmol/L)
  • 1 concepts-set (it groups the 2 concepts because they are the same biological entity)

1. Input

The BioClean table is based on three tables provided by the data-scientist in OMOP format:

The Concepts-set table contains the meta-concepts of interest for the user.

Image title

2. Extract concepts-sets

In order to work on the measurements of interest, the user can extract a list of concepts-sets by:

  • Selecting default concepts-sets provided in the library which represent common biological entities.
  • Editing default concepts-sets if needed, modifying the codes of a selected default concepts-set.
  • Creating a concepts-set from scratch.
from eds_scikit.biology import ConceptsSet

hemoglobin = ConceptsSet("Hemoglobin_Blood_Quantitative") # (1)
hemoglobin.concept_codes.append("C87545") # (2)
my_custom_concepts_set = ConceptsSet(
    name="Custom_entity",
    concept_codes=["A2458", "B87985"],
) # (3)

my_concepts_sets = [my_custom_concepts_set, hemoglobin]
  1. Select default concepts-set by giving the name of a default concepts-set
  2. Edit default concepts-set
  3. Create new concepts-set from scratch

Disclaimer

The list of default concepts-set is still in progress. We update it regularly and you are welcomed to contribute. See our contributing guidelines.

3. Normalize units

The bioclean function converts to the same unit all the measurements of the same concepts-set. This feature is based on a csv configuration file listing the conversion coefficients of the default concepts-set.

Image title

Manually set

For the moment, there is no automatic unit conversion and the Coefficient column has to be set manually if you want to create your own configuration.

4. Detect outliers

It detects outliers based on the Median Absolute Deviation (MAD) Methodology1. This statistical method computes the max_threshold and min_threshold columns.

Image title

Statistics

The default configuration file is based on statistical summaries of the AP-HP's Data Warehouse and is especially well fitted for it.

If needed, you can create your own configuration file by using the statistical summaries of your data. For more details, please see the tutorial.


  1. FAIR Health. Application of the mad (median absolute deviation) methodology to exclude extreme data values in fair health products. 2017. URL: https://s3.amazonaws.com/media2.fairhealth.org/resource/asset/FH%20Methodology%20-%20Median%20Absolute%20Deviation.pdf