Documentation: https://aphp.github.io/edsteva/latest/
Source Code: https://github.com/aphp/edsteva
Getting Started
EDS-TeVa provides a set of tools to characterize the temporal variability of data induced by the dynamics of the clinical IT system.
Context
Real world data is subject to important temporal drifts that may be caused by a variety of factors1. In particular, data availability fluctuates with the deployment of clinical softwares and their clinical use. The dynamics of software deployment and adoption is not trivial as it depends on the care site and on the category of data that are considered.
Installation
Requirements
EDS-TeVa stands on the shoulders of Spark 2.4 which runs on Java 8 and Python ~3.7.1, it is essential to:
- Install a version of Python
and . -
Install OpenJDK 8, an open-source reference implementation of Java 8 wit the following command lines:
For more details, check this installation guide
For more details, check this installation guide
Follow this installation guide
You can install EDS-TeVa through pip
:
We recommend pinning the library version in your projects, or use a strict package manager like Poetry.
pip install edsteva==0.2.8
Working example: administrative records relative to visits
Let's consider a basic category of data: administrative records relative to visits. A visit is characterized by a care site, a length of stay, a stay type (full hospitalisation, emergency, consultation, etc.) and other characteristics. In this example, the objective is to estimate the availability of visits records with respect to time, care site and stay type.
1. Load your data
As detailled in the dedicated section, EDS-TeVa is expecting to work with Pandas or Koalas DataFrames. We provide various connectors to facilitate data fetching, namely a Hive connector, a Postgres connector and a LocalData.
from edsteva.io import HiveData
db_name = "my_db"
tables_to_load = [
"visit_occurrence",
"visit_detail",
"care_site",
"fact_relationship",
]
data = HiveData(db_name, tables_to_load=tables_to_load)
data.visit_occurrence #
from edsteva.io import PostgresData
db_name = "my_db"
schema = "my_schema"
user = "my_username"
data = PostgresData(db_name, schema=schema, user=user) #
data.visit_occurrence #
import os
from edsteva.io import LocalData
folder = os.path.abspath(MY_FOLDER_PATH)
data = LocalData(folder) #
data.visit_occurrence #
2. Choose a Probe or create a new Probe
Probe
A Probe is a python class designed to compute a completeness predictor
In this example,
If the maximum number of records per month
The VisitProbe is already available by default in the library:
2.1 Compute your Probe
The compute()
method takes a Data object as input and stores the computed completeness predictor predictor
attribute of a Probe
:
from edsteva.probes import VisitProbe
probe_path = "my_path/visit.pkl"
visit = VisitProbe()
visit.compute(
data,
care_site_levels=["Hospital", "Pole", "UF"], #
stay_types={
"All": ".*",
"Urg_Hospit": "urgence|hospitalisés", #
},
care_site_specialties=None, #
stay_sources=None, #
length_of_stays=None, #
provenance_sources=None, #
age_ranges=None, #
)
visit.save(path=probe_path) #
visit.predictor.head()
Saved to /my_path/visit.pkl
care_site_level | care_site_id | care_site_short_name | stay_type | date | n_visit | c |
---|---|---|---|---|---|---|
Unité Fonctionnelle (UF) | 8312056386 | Care site 1 | 'Urg_Hospit' | 2019-05-01 | 233.0 | 0.841 |
Unité Fonctionnelle (UF) | 8312056386 | Care site 1 | 'All' | 2021-04-01 | 393.0 | 0.640 |
Pôle/DMU | 8312027648 | Care site 2 | 'Urg_Hospit' | 2017-03-01 | 204.0 | 0.497 |
Pôle/DMU | 8312027648 | Care site 2 | 'All' | 2018-08-01 | 22.0 | 0.274 |
Hôpital | 8312022130 | Care site 3 | 'Urg_Hospit' | 2022-02-01 | 9746.0 | 0.769 |
2.2 Filter your Probe
In this example, we are interested in three hospitals. We consequently filter data before any further analysis.
from edsteva.probes import VisitProbe
care_site_short_name = ["Hôpital-1", "Hôpital-2", "Hôpital-3"]
filtered_visit = VisitProbe()
filtered_visit.load(path=probe_path)
filtered_visit.filter_care_site(care_site_short_names=care_site_short_name) #
2.3 Visualize your Probe
Interactive dashboard
Interactive dashboards can be used to visualize the average completeness predictor
from edsteva.viz.dashboards import probe_dashboard
probe_dashboard(
probe=filtered_visit,
)
Static plot
If you need a static plot for a report, a paper or anything else, you can use the probe_plot()
function. It returns the top plot of the dashboard without the interactive filters. Consequently, you have to specify the filters in the inputs of the function.
from edsteva.viz.plots import probe_plot
plot_path = "my_path/visit.html"
stay_type = "All"
probe_plot(
probe=filtered_visit,
care_site_level="Hospital",
stay_type=stay_type,
save_path=plot_path, #
)
3. Choose a Model or create a new Model
A Model is a python class designed to fit a function
In this example, the model fits a step function
- the characteristic time
estimates the time after which the data is available. - the characteristic value
estimates the stabilized routine completeness.
It also computes the following
This step function Model is available in the library.
3.1 Fit your Model
The fit
method takes a Probe as input, it estimates the coefficients, for example by minimizing a quadratic loss function and computes the metrics. Finally, it stores the estimated coefficients and the computed metrics in the estimates
attribute of the Model
.
from edsteva.models.step_function import StepFunction
model_path = "my_path/fitted_visit.pkl"
step_function_model = StepFunction()
step_function_model.fit(probe=filtered_visit)
step_function_model.save(model_path) #
step_function_model.estimates.head()
Saved to /my_path/fitted_visit.pkl
care_site_level | care_site_id | stay_type | t_0 | c_0 | error |
---|---|---|---|---|---|
Pôle/DMU | 8312056386 | 'Urg_Hospit' | 2019-05-01 | 0.397 | 0.040 |
Pôle/DMU | 8312056386 | 'All' | 2017-04-01 | 0.583 | 0.028 |
Pôle/DMU | 8312027648 | 'Urg_Hospit' | 2021-03-01 | 0.677 | 0.022 |
Pôle/DMU | 8312027648 | 'All' | 2018-08-01 | 0.764 | 0.014 |
Pôle/DMU | 8312022130 | 'Urg_Hospit' | 2022-02-01 | 0.652 | 0.027 |
3.2 Visualize your fitted Probe
Interactive dashboard
Interactive dashboards can be used to visualize the average completeness predictor
from edsteva.viz.dashboards import probe_dashboard
probe_dashboard(
probe=filtered_visit,
fitted_model=step_function_model,
)
Static plot
If you need a static plot for a report, a paper or anything else, you can use the probe_plot()
function. It returns the top plot of the dashboard without the interactive filters. Consequently, you have to specify the filters in the inputs of the function.
from edsteva.viz.plots import probe_plot
plot_path = "my_path/fitted_visit.html"
stay_type = "All"
probe_plot(
probe=filtered_visit,
fitted_model=step_function_model,
care_site_level="Hospital",
stay_type=stay_type,
save_path=plot_path, # (1)
)
save_path
is specified, it'll save your plot in the specified path.
4. Set the thresholds to fix the deployment bias
Now, that we have estimated
4.1 Visualize estimates distributions
Visualizing the density plots and the medians of the estimates can help you setting the thresholds' values.
from edsteva.viz.plots import estimates_densities_plot
estimates_densities_plot(
probe=filtered_visit,
fitted_model=step_function_model,
)
4.2 Set the thresholds
The estimates dashboard provides a representation of the overall deviation from the Model on the top and interactive sliders on the bottom that allows you to vary the thresholds. The idea is to set the thresholds that keep the most care sites while having an acceptable overall deviation.
from edsteva.viz.dashboards import estimates_dashboard
estimates_dashboard(
probe=filtered_visit,
fitted_model=step_function_model,
)
The threshold dashboard is available here.
4.3 Fix the deployment bias
Once you set the thresholds, you can extract for each stay type the care sites for which data availability is estimated to be stable over the entire study period.
t_0_max = "2020-01-01" #
c_0_min = 0.6 #
error_max = 0.05 #
estimates = step_function_model.estimates
selected_care_site = estimates[
(estimates["t_0"] <= t_0_max)
& (estimates["c_0"] >= c_0_min)
& (estimates["error"] <= error_max)
]
print(selected_care_site["care_site_id"].unique())
[8312056386, 8457691845, 8745619784, 8314578956, 8314548764, 8542137845]
In this example,
Limitations
EDS-TeVa provides modelling tools to characterize the temporal variability of your data, it does not intend to provide direct methods to fix the deployment bias. As an open-source library, EDS-TeVa is also here to host a discussion in order to facilitate collective methodological convergence on flexible solutions. The default methods proposed in this example is intended to be reviewed and challenged by the user community.
Make it your own
The working example above describes the canonical usage workflow. However, you would probably need different Probes, Models, Visualizations and methods to set the thresholds for your projects. The components already available in the library are listed below but if it doesn't meet your requirements, you are encouraged to create your own.
Contribution
If you managed to implement your own component, or even if you just thought about a new component do not hesitate to share it with the community by following the contribution guidelines. Contributions are welcome, and they are greatly appreciated! Every little bit helps, and credit will always be given.
Available components
The VisitProbe
computes
Where
If the maximum number of records per month
from edsteva.probes import VisitProbe
visit = VisitProbe()
visit.compute(
data,
stay_types={
"Urg": "urgence",
"Hospit": "hospitalisés",
"Urg_Hospit": "urgence|hospitalisés",
},
)
visit.predictor.head()
care_site_level | care_site_id | care_site_short_name | stay_type | date | n_visit | c |
---|---|---|---|---|---|---|
Unité Fonctionnelle (UF) | 8312056386 | Care site 1 | 'Urg' | 2019-05-01 | 233.0 | 0.841 |
Unité Fonctionnelle (UF) | 8312056386 | Care site 1 | 'Urg' | 2021-04-01 | 393.0 | 0.640 |
Pôle/DMU | 8312027648 | Care site 2 | 'Hospit' | 2017-03-01 | 204.0 | 0.497 |
Pôle/DMU | 8312027648 | Care site 2 | 'Urg' | 2018-08-01 | 22.0 | 0.274 |
Hôpital | 8312022130 | Care site 3 | 'Urg_Hospit' | 2022-02-01 | 9746.0 | 0.769 |
The NoteProbe
computes
The per_visit_default
algorithm computes
Where
If the number of visits
from edsteva.probes import NoteProbe
note = Note(completeness_predictor="per_visit_default")
note.compute(
data,
stay_types={
"Urg": "urgence",
"Hospit": "hospitalisés",
"Urg_Hospit": "urgence|hospitalisés",
},
note_types={
"All": ".*",
"CRH": "crh",
"Ordonnance": "ordo",
"CR Passage Urgences": "urge",
},
)
note.predictor.head()
care_site_level | care_site_id | care_site_short_name | stay_type | note_type | date | n_visit | n_visit_with_note | c |
---|---|---|---|---|---|---|---|---|
Unité Fonctionnelle (UF) | 8312056386 | Care site 1 | 'Urg' | 'All' | 2019-05-01 | 233.0 | 196.0 | 0.841 |
Unité Fonctionnelle (UF) | 8653815660 | Care site 1 | 'Hospit' | 'CRH' | 2017-04-01 | 393.0 | 252.0 | 0.640 |
Pôle/DMU | 8312027648 | Care site 2 | 'Hospit' | 'CRH' | 2021-03-01 | 204.0 | 101.0 | 0.497 |
Pôle/DMU | 8312056379 | Care site 2 | 'Urg' | 'Ordonnance' | 2018-08-01 | 22.0 | 6.0 | 0.274 |
Hôpital | 8312022130 | Care site 3 | 'Urg_Hospit' | 'CR Passage Urgences' | 2022-02-01 | 9746.0 | 7495.0 | 0.769 |
The per_note_default
algorithm computes
Where
If the maximum number of recorded notes per month
from edsteva.probes import NoteProbe
note = Note(completeness_predictor="per_note_default")
note.compute(
data,
stay_types={
"Urg": "urgence",
"Hospit": "hospitalisés",
"Urg_Hospit": "urgence|hospitalisés",
},
note_types={
"All": ".*",
"CRH": "crh",
"Ordonnance": "ordo",
"CR Passage Urgences": "urge",
},
)
note.predictor.head()
care_site_level | care_site_id | care_site_short_name | stay_type | note_type | date | n_note | c |
---|---|---|---|---|---|---|---|
Unité Fonctionnelle (UF) | 8312056386 | Care site 1 | 'Urg' | 'All' | 2019-05-01 | 233.0 | 0.841 |
Unité Fonctionnelle (UF) | 8653815660 | Care site 1 | 'Hospit' | 'CRH' | 2017-04-01 | 393.0 | 0.640 |
Pôle/DMU | 8312027648 | Care site 2 | 'Hospit' | 'CRH' | 2021-03-01 | 204.0 | 0.497 |
Pôle/DMU | 8312056379 | Care site 2 | 'Urg' | 'Ordonnance' | 2018-08-01 | 22.0 | 0.274 |
Hôpital | 8312022130 | Care site 3 | 'Urg_Hospit' | 'CR Passage Urgences' | 2022-02-01 | 9746.0 | 0.769 |
The ConditionProbe
computes
The per_visit_default
algorithm computes
Where
If the number of visits
Care site level
AREM claim data are only available at hospital level.
from edsteva.probes import ConditionProbe
condition = ConditionProbe(completeness_predictor="per_visit_default")
condition.compute(
data,
stay_types={
"Hospit": "hospitalisés",
},
diag_types={
"All": ".*",
"DP/DR": "DP|DR",
},
condition_types={
"All": ".*",
"Pulmonary_embolism": "I26",
},
source_systems=["AREM", "ORBIS"],
)
condition.predictor.head()
care_site_level | care_site_id | care_site_short_name | stay_type | diag_type | condition_type | source_systems | date | n_visit | n_visit_with_condition | c |
---|---|---|---|---|---|---|---|---|---|---|
Hôpital | 8312057527 | Care site 1 | 'Hospit' | 'All' | 'Pulmonary_embolism' | AREM | 2019-05-01 | 233.0 | 196.0 | 0.841 |
Hôpital | 8312057527 | Care site 1 | 'Hospit' | 'DP/DR' | 'Pulmonary_embolism' | AREM | 2021-04-01 | 393.0 | 252.0 | 0.640 |
Hôpital | 8312027648 | Care site 2 | 'Hospit' | 'All' | 'Pulmonary_embolism' | AREM | 2017-03-01 | 204.0 | 101.0 | 0.497 |
Unité Fonctionnelle (UF) | 8312027648 | Care site 2 | 'Hospit' | 'All' | 'All' | ORBIS | 2018-08-01 | 22.0 | 6.0 | 0.274 |
Pôle/DMU | 8312022130 | Care site 3 | 'Hospit' | 'DP/DR' | 'Pulmonary_embolism' | ORBIS | 2022-02-01 | 9746.0 | 7495.0 | 0.769 |
The per_condition_default
algorithm computes
Where
If the maximum number of recorded diagnosis per month
from edsteva.probes import ConditionProbe
condition = ConditionProbe(completeness_predictor="per_condition_default")
condition.compute(
data,
stay_types={
"All": ".*",
"Hospit": "hospitalisés",
},
diag_types={
"All": ".*",
"DP/DR": "DP|DR",
},
condition_types={
"All": ".*",
"Pulmonary_embolism": "I26",
},
source_systems=["AREM", "ORBIS"],
)
condition.predictor.head()
care_site_level | care_site_id | care_site_short_name | stay_type | diag_type | condition_type | source_systems | date | n_condition | c |
---|---|---|---|---|---|---|---|---|---|
Hôpital | 8312057527 | Care site 1 | 'Hospit' | 'All' | 'Pulmonary_embolism' | AREM | 2019-05-01 | 233.0 | 0.841 |
Hôpital | 8312057527 | Care site 1 | 'Hospit' | 'DP/DR' | 'Pulmonary_embolism' | AREM | 2021-04-01 | 393.0 | 0.640 |
Hôpital | 8312027648 | Care site 2 | 'Hospit' | 'All' | 'Pulmonary_embolism' | AREM | 2017-03-01 | 204.0 | 0.497 |
Unité Fonctionnelle (UF) | 8312027648 | Care site 2 | 'Hospit' | 'All' | 'All' | ORBIS | 2018-08-01 | 22.0 | 0.274 |
Pôle/DMU | 8312022130 | Care site 3 | 'Hospit' | 'DP/DR' | 'Pulmonary_embolism' | ORBIS | 2022-02-01 | 9746.0 | 0.769 |
The BiologyProbe
computes
The per_visit_default
algorithm computes
Where
If the number of visits
Care site level
Laboratory data are only available at hospital level.
from edsteva.probes import BiologyProbe
biology = BiologyProbe(completeness_predictor="per_visit_default")
biology.compute(
data,
stay_types={
"Hospit": "hospitalisés",
},
concepts_sets={
"Créatinine": "E3180|G1974|J1002|A7813|A0094|G1975|J1172|G7834|F9409|F9410|C0697|H4038|F2621",
"Leucocytes": "A0174|K3232|H6740|E4358|C9784|C8824|E6953",
},
)
biology.predictor.head()
care_site_level | care_site_id | care_site_short_name | stay_type | concepts_sets | date | n_visit | n_visit_with_measurement | c |
---|---|---|---|---|---|---|---|---|
Hôpital | 8312057527 | Care site 1 | 'Hospit' | 'Créatinine' | 2019-05-01 | 233.0 | 196.0 | 0.841 |
Hôpital | 8312057527 | Care site 1 | 'Hospit' | 'Leucocytes' | 2021-04-01 | 393.0 | 252.0 | 0.640 |
Hôpital | 8312027648 | Care site 2 | 'Hospit' | 'Créatinine' | 2017-03-01 | 204.0 | 101.0 | 0.497 |
Hôpital | 8312027648 | Care site 2 | 'Hospit' | 'Leucocytes' | 2018-08-01 | 22.0 | 6.0 | 0.274 |
Hôpital | 8312022130 | Care site 3 | 'Hospit' | 'Leucocytes' | 2022-02-01 | 9746.0 | 7495.0 | 0.769 |
The per_measurement_default
algorithm computes
Where
If the maximum number of recorded biological measurements per month
Care site level
Laboratory data are only available at hospital level.
from edsteva.probes import BiologyProbe
biology = BiologyProbe(completeness_predictor="per_measurement_default")
biology.compute(
data,
stay_types={
"Hospit": "hospitalisés",
},
concepts_sets={
"Créatinine": "E3180|G1974|J1002|A7813|A0094|G1975|J1172|G7834|F9409|F9410|C0697|H4038|F2621",
"Leucocytes": "A0174|K3232|H6740|E4358|C9784|C8824|E6953",
},
)
biology.predictor.head()
care_site_level | care_site_id | care_site_short_name | stay_type | concepts_sets | date | n_measurement | c |
---|---|---|---|---|---|---|---|
Hôpital | 8312057527 | Care site 1 | 'Hospit' | 'Créatinine' | 2019-05-01 | 233.0 | 0.841 |
Hôpital | 8312057527 | Care site 1 | 'Hospit' | 'Leucocytes' | 2021-04-01 | 393.0 | 0.640 |
Hôpital | 8312027648 | Care site 2 | 'Hospit' | 'Créatinine' | 2017-03-01 | 204.0 | 0.497 |
Unité Fonctionnelle (UF) | 8312027648 | Care site 2 | 'Hospit' | 'Leucocytes' | 2018-08-01 | 22.0 | 0.274 |
Pôle/DMU | 8312022130 | Care site 3 | 'Hospit' | 'Leucocytes' | 2022-02-01 | 9746.0 | 0.769 |
The StepFunction
fits a step function
- the characteristic time
estimates the time after which the data is available. - the characteristic value
estimates the stabilized routine completeness.
The default metric computed is the mean squared error after
estimates the stability of the data after .
Custom metric
You can define your own metric if this one doesn't meet your requirements.
The available algorithms used to fit the step function are listed below:
Custom algo
You can define your own algorithm if they don't meet your requirements.
This algorithm computes the estimated coefficients
Default loss function
The loss function is
Optimal estimates
For complexity purposes, this algorithm has been implemented with a dependency relation between
In this algorithm,
Default quantile
The default quantile is
from edsteva.models.step_function import StepFunction
step_function_model = StepFunction()
step_function_model.fit(probe)
step_function_model.estimates.head()
care_site_level | care_site_id | stay_type | t_0 | c_0 | error |
---|---|---|---|---|---|
Unité Fonctionnelle (UF) | 8312056386 | 'Urg' | 2019-05-01 | 0.397 | 0.040 |
Unité Fonctionnelle (UF) | 8312056386 | 'All' | 2017-04-01 | 0.583 | 0.028 |
Pôle/DMU | 8312027648 | 'Hospit' | 2021-03-01 | 0.677 | 0.022 |
Pôle/DMU | 8312027648 | 'All' | 2018-08-01 | 0.764 | 0.014 |
Hôpital | 8312022130 | 'Hospit' | 2022-02-01 | 0.652 | 0.027 |
The RectangleFunction
fits a step function
- the characteristic time
estimates the time after which the data is available. - the characteristic time
estimates the time after which the data is not available anymore. - the characteristic value
estimates the completeness between and .
The default metric computed is the mean squared error between
estimates the stability of the data between and .
Custom metric
You can define your own metric if this one doesn't meet your requirements.
The available algorithms used to fit the step function are listed below:
Custom algo
You can define your own algorithm if they don't meet your requirements.
This algorithm computes the estimated coefficients
Default loss function
The loss function is
Optimal estimates
For complexity purposes, this algorithm has been implemented with a dependency relation between
from edsteva.models.rectangle_function import RectangleFunction
rectangle_function_model = RectangleFunction()
rectangle_function_model.fit(probe)
rectangle_function_model.estimates.head()
care_site_level | care_site_id | stay_type | t_0 | c_0 | t_1 | error |
---|---|---|---|---|---|---|
Unité Fonctionnelle (UF) | 8312056386 | 'Urg' | 2019-05-01 | 0.397 | 2020-05-01 | 0.040 |
Unité Fonctionnelle (UF) | 8312056386 | 'All' | 2017-04-01 | 0.583 | 2013-04-01 | 0.028 |
Pôle/DMU | 8312027648 | 'Hospit' | 2021-03-01 | 0.677 | 2022-03-01 | 0.022 |
Pôle/DMU | 8312027648 | 'All' | 2018-08-01 | 0.764 | 2019-08-01 | 0.014 |
Hôpital | 8312022130 | 'Hospit' | 2022-02-01 | 0.652 | 2022-08-01 | 0.027 |
The library provides interactive dashboards that let you set any combination of care sites, stay types and other columns if included in the Probe. You can only export a dashboard in HTML format.
The probe_dashboard()
returns:
- On the top, the aggregated variable is the average completeness predictor
over time with the prediction if the fitted Model is specified. - On the bottom, the interactive filters are all the columns included in the Probe (such as time, care site, number of visits...etc.).
from edsteva.viz.dashboards import probe_dashboard
probe_dashboard(
probe=probe,
fitted_model=step_function_model,
care_site_level=care_site_level,
)
The normalized_probe_dashboard()
returns a representation of the overall deviation from the Model:
- On the top, the aggregated variable is a normalized completeness predictor
over normalized time . - On the bottom, the interactive filters are all the columns included in the Probe (such as time, care site, number of visits...etc.) with all the Model coefficients and metrics included in the Model.
from edsteva.viz.dashboards import normalized_probe_dashboard
normalized_probe_dashboard(
probe=probe,
fitted_model=step_function_model,
care_site_level=care_site_level,
)
An example is available here.
The library provides static plots that you can export in png or svg. As it is less interactive, you may specify the filters in the inputs of the functions.
The probe_plot()
returns the top plot of the probe_dashboard()
: the normalized completeness predictor
from edsteva.viz.plots import probe_plot
probe_plot(
probe=probe,
fitted_model=step_function_model,
care_site_level=care_site_level,
stay_type=stay_type,
save_path=plot_path,
)
The normalized_probe_plot()
returns the top plot of the normalized_probe_dashboard()
. Consequently, you have to specify the filters in the inputs of the function.
from edsteva.viz.plots import normalized_probe_plot
normalized_probe_plot(
probe=probe,
fitted_model=step_function_model,
t_min=-15,
t_max=15,
save_path=plot_path,
)
The estimates_densities_plot()
returns the density plot and the median of each estimate. It can help you to set the thresholds.
from edsteva.viz.plots import estimates_densities_plot
estimates_densities_plot(
fitted_model=step_function_model,
)
-
Samuel G Finlayson, Adarsh Subbaswamy, Karandeep Singh, John Bowers, Annabel Kupke, Jonathan Zittrain, Isaac S Kohane, and Suchi Saria. The clinician and dataset shift in artificial intelligence. The New England journal of medicine, 385(3):283, 2021. ↩