You can download this notebook directly here
Tutorial
This tutorial takes you through the entire workflow of the Biology module.
%reload_ext autoreload
%autoreload 2
import eds_scikit
import pandas as pd
spark, sc, sql = eds_scikit.improve_performances() #
1. Load Data
First, you need to load your data. As detailed in the dedicated section, eds-scikit is expecting to work with Pandas or Koalas DataFrames. We provide various connectors to facilitate data fetching, namely a Hive connector, a Postgres connector and a Pandas connector.
This tutorial uses the Hive connector.
from eds_scikit.io import HiveData
data = HiveData(
database_name="cse_XXX",
tables_to_load=[
"care_site",
"concept",
"concept_relationship",
"measurement",
"visit_occurrence",
],
)
2. Define your concepts-sets
In order to work on the measurements of interest, you can extract a list of concepts-sets by:
- Selecting default concepts-sets provided in the library.
- Modifying the codes of a selected default concepts-set.
- Creating a concepts-set from scratch.
This tutorial uses all the default concepts-set with an additional custom concepts-set.
from eds_scikit.biology import ConceptsSet
protein_blood = ConceptsSet("Protein_Blood_Quantitative")
protein_urine = ConceptsSet("Protein_Urine_Quantitative")
protein = ConceptsSet(
name="Protein_Quantitative",
concept_codes=protein_blood.concept_codes + protein_urine.concept_codes,
)
custom_entity = ConceptsSet(
name="Custom_entity", concept_codes=["G6616", "I2013", "C2102"]
)
concepts_sets = [
protein,
custom_entity,
]
3. Define the configuration
The configuration files does 3 things:
- Remove outliers
- Remove unwanted codes
- Normalize units
3.1 The default configuration
A default configuration is available when working on APHP's CDW. You can access it via:
from eds_scikit.resources import registry
biology_config = registry.get("data", "get_biology_config.all_aphp")()
3.2 Create your own configuration (OPTIONAL)
If this default configuration file does not meet your requirements, you can follow this tutorial to create your own configuration file.
As a reminder, a configuration file is a csv table where each row corresponds to a given standard concept_code and a given unit. For each row, it gives a maximum threshold and a minimum threshold to flag outliers and a unit conversion coefficient to normalize units if needed.
3.2.1 Plot statistical summary
The first step is to compute the statistical summary of each concepts-set with the function plot_biology_summary(stats_only=True)
.
from eds_scikit.biology import plot_biology_summary
start_date = "2017-01-01"
end_date = "2022-01-01"
plot_biology_summary(
data,
concepts_sets=concepts_sets,
start_date=start_date,
end_date=end_date,
stats_only=True,
)
By default, the data will be saved in the Biology_summary
folder.
Each ConceptSet
will have its own folder.
Here, we used, stats_only=True
, so
- No graphical dashboard will be generated
- Data will not be stratified by care site
Let us display the results for the protein-related ConceptSet
:
pd.read_csv("./Biology_summary/Protein_Quantitative/stats_summary.csv")
3.2.2 Create configuration from statistical summary
Then, you can use the function create_config_from_stats()
to pre-fill the configuration file with max_threshold
and min_threshold
. The thresholds computation is based on the Median Absolute Deviation (MAD) Methodology[@madmethodology].
from eds_scikit.biology.utils.config import create_config_from_stats
config_name = "my_custom_config"
create_config_from_stats(
concepts_sets=concepts_sets,
config_name=config_name,
)
3.2.3 Edit units manually
The transformed_unit
column is pre-filled with the unit that corresponds to the most measurements. When you notice a unit_source_value
different than a transformed_unit
, it probably means that the concept's unit needs to be normalized.
- To normalize the unit of a concept you need to fill in manually the
Action
column with Transform and theCoefficient
column with the unit conversion factor. - If you consider the concept irrelevant, you can fill in the
Action
column with Delete and it will delete the measurements corresponding to the concept. - If the
unit_source_value
matches thetransformed_unit
you can leave theAction
and theCoefficient
columns empty.
3.2.4 Use your custom configuration
Once you created your configuration (for instance under the name config_name="my_custom_config"
), you can use provide it to the relevant functions (see below).
You can also check the configuration file directly:
from eds_scikit.resources import registry
config = registry.get("data", "biology_config.my_custom_config")()
4. Clean the data
Now you can use the bioclean()
function with your custom configuration or the default configuration to:
It will add a bioclean
table to your data
. For more details, have a look on the dedicated section.
from eds_scikit.biology import bioclean
bioclean(
data,
concepts_sets=concepts_sets,
config_name=config_name, # use config_name="all_aphp" for APHP's default configuration
start_date=start_date,
end_date=end_date,
)
5. Visualize the statistical summary of clean data
Finally, you can build and save two interactive dashboards and a summary table for each concepts-set. It describes various statistical properties of your clean data.
from eds_scikit.biology import plot_biology_summary
plot_biology_summary(data)