Detailed use
This tutorial demonstrates the workflow to prepare the measurement table.
Big volume
Measurement table can be large. Do not forget to set proper spark config before loading data.
Mapping measurement table to ANABIO codes
Defining Concept-Set
Here we work with the Glucose pre-defined concept set. See quick-use for an example on how to create a custom concept set.
from eds_scikit.biology import prepare_measurement_table, ConceptsSet
glucose_blood = ConceptsSet("Glucose_Blood")
Preparing measurement table
First, we prepare measurements with convert_units = False
(as we do not yet know which units are contained in the table).
from eds_scikit.biology import measurement_values_summary
measurement = prepare_measurement_table(
data,
start_date="2022-01-01",
end_date="2022-05-01",
concept_sets=[glucose_blood],
convert_units=False,
get_all_terminologies=False,
)
Statistical summary
A statistical summary by codes allows us to gain insight into value distributions and detect possible heterogeneous units.
from eds_scikit.biology import measurement_values_summary
stats_summary = measurement_values_summary(
measurement,
category_cols=["concept_set", "GLIMS_ANABIO_concept_code"],
value_column="value_as_number",
unit_column="unit_source_value",
)
stats_summary
concept_set | ANABIO_concept_code | no_units | unit_source_value | range_low_anomaly_count | range_high_anomaly_count | measurement_count | value_as_number_count | value_as_number_mean | value_as_number_std | value_as_number_min | value_as_number_25% | value_as_number_50% | value_as_number_75% | value_as_number_max |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Glucose_Blood | XXXXX | 100 | mmol/l | 15 | 5 | 1000 | 1000 | 5 | 2 | 0 | 2 | 5 | 8 | 9 |
Glucose_Blood | YYYYY | 50 | mg/ml | 20 | 10 | 5000 | 5000 | 25 | 10 | 0 | 20 | 25 | 37 | 45 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Units correction
To map all units to a common unit base we can use add_conversion
and add_target_unit
from ConceptSet
class.
glucose_blood.add_conversion("mol", "g", 180)
glucose_blood.add_target_unit("mmol/l")
We can check the new summary table after units conversion.
stats_summary = measurement_values_summary(
measurement,
category_cols=["concept_set", "GLIMS_ANABIO_concept_code"],
value_column="value_as_number_normalized",
unit_column="unit_source_value_normalized",
)
stats_summary
concept_set | ANABIO_concept_code | no_units | unit_source_value | range_low_anomaly_count | range_high_anomaly_count | measurement_count | value_as_number_count | value_as_number_mean | value_as_number_std | value_as_number_min | value_as_number_25% | value_as_number_50% | value_as_number_75% | value_as_number_max |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Glucose_Blood | XXXXX | 100 | mmol/l | 15 | 5 | 1000 | 1000 | 5 | 2 | 0 | 2 | 5 | 8 | 9 |
Glucose_Blood | YYYYY | 50 | mmol/l | 20 | 10 | 5000 | 5000 | 5 | 2 | 0 | 4 | 5 | 7 | 9 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Plot summary
Once all units are homogeneous, we can generate more detailed dashboard for biology investigation.
from eds_scikit.biology import plot_biology_summary
measurement = measurement.merge(
data.visit_occurrence[["care_site_id", "visit_occurrence_id"]],
on="visit_occurrence_id",
)
measurement = measurement.merge(
data.care_site[["care_site_id", "care_site_short_name"]], on="care_site_id"
)
plot_biology_summary(measurement, value_column="value_as_number_normalized")