Documentation: https://aphp.github.io/edsteva/latest/
Source Code: https://github.com/aphp/edsteva
Getting Started
EDS-TeVa provides a set of tools to characterize the temporal variability of data induced by the dynamics of the clinical IT system.
Context
Real world data is subject to important temporal drifts that may be caused by a variety of factors1. In particular, data availability fluctuates with the deployment of clinical softwares and their clinical use. The dynamics of software deployment and adoption is not trivial as it depends on the care site and on the category of data that are considered.
Installation
Requirements
EDS-TeVa stands on the shoulders of Spark 2.4 which runs on Java 8 and Python ~3.7.1, it is essential to:
- Install a version of Python \(\geq 3.7.1\) and \(< 3.8\).
-
Install OpenJDK 8, an open-source reference implementation of Java 8 wit the following command lines:
$ sudo apt-get update $ sudo apt-get install openjdk-8-jdk ---> 100%
For more details, check this installation guide
$ brew tap AdoptOpenJDK/openjdk $ brew install --cask adoptopenjdk8 ---> 100%
For more details, check this installation guide
Follow this installation guide
You can install EDS-TeVa through pip
:
$ pip install edsteva
---> 100%
color:green Successfully installed edsteva
We recommend pinning the library version in your projects, or use a strict package manager like Poetry.
pip install edsteva==0.2.8
Working example: administrative records relative to visits
Let's consider a basic category of data: administrative records relative to visits. A visit is characterized by a care site, a length of stay, a stay type (full hospitalisation, emergency, consultation, etc.) and other characteristics. In this example, the objective is to estimate the availability of visits records with respect to time, care site and stay type.
1. Load your data
As detailled in the dedicated section, EDS-TeVa is expecting to work with Pandas or Koalas DataFrames. We provide various connectors to facilitate data fetching, namely a Hive connector, a Postgres connector and a LocalData.
from edsteva.io import HiveData
db_name = "my_db"
tables_to_load = [
"visit_occurrence",
"visit_detail",
"care_site",
"fact_relationship",
]
data = HiveData(db_name, tables_to_load=tables_to_load)
data.visit_occurrence # (1)
- With this connector,
visit_occurrence
will be a Koalas DataFrame
from edsteva.io import PostgresData
db_name = "my_db"
schema = "my_schema"
user = "my_username"
data = PostgresData(db_name, schema=schema, user=user) # (1)
data.visit_occurrence # (2)
- This connector expects a
.pgpass
file storing the connection parameters - With this connector,
visit_occurrence
will be a Pandas DataFrame
import os
from edsteva.io import LocalData
folder = os.path.abspath(MY_FOLDER_PATH)
data = LocalData(folder) # (1)
data.visit_occurrence # (2)
- This connector expects a
folder
with a file per table to load. - With this connector,
visit_occurrence
will be a Pandas DataFrame
2. Choose a Probe or create a new Probe
Probe
A Probe is a python class designed to compute a completeness predictor \(c(t)\) that characterizes data availability of a target variable over time \(t\).
In this example, \(c(t)\) predicts the availability of administrative records relative to visits. It is defined for each characteristic (care site, stay type, age range, length of stay, etc.) as the number of visits \(n_{visit}(t)\) per month \(t\), normalized by the maximum number of records per month \(n_{max} = \max_{t}(n_{visit}(t))\) computed over the entire study period:
If the maximum number of records per month \(n_{max}\) is equal to 0, we consider that the completeness predictor \(c(t)\) is also equal to 0.
The VisitProbe is already available by default in the library:
2.1 Compute your Probe
The compute()
method takes a Data object as input and stores the computed completeness predictor \(c(t)\) in the predictor
attribute of a Probe
:
from edsteva.probes import VisitProbe
probe_path = "my_path/visit.pkl"
visit = VisitProbe()
visit.compute(
data,
care_site_levels=["Hospital", "Pole", "UF"], # (1)
stay_types={
"All": ".*",
"Urg_Hospit": "urgence|hospitalisés", # (2)
},
care_site_specialties=None, # (3)
stay_sources=None, # (4)
length_of_stays=None, # (5)
provenance_sources=None, # (6)
age_ranges=None, # (7)
)
visit.save(path=probe_path) # (8)
visit.predictor.head()
- The care sites are articulated into levels (cf. AP-HP's reference structure). Here, as an example, we are only interested in those three levels.
- The
stay_types
argument expects a python dictionary with labels as keys and regex as values. - In this example we want to ignore the care site specialty (e.g., Cardiology, Pediatrics).
- In this example we want to ignore the stay source (e.g., MCO, SSR, PSY).
- In this example we want to ignore the length of stay (e.g., \(>=\) 7 days, \(<=\) 2 days).
- In this example we want to ignore the provenance source (e.g., service d'urgence, d'une unité de soins de courte durée).
- In this example we want to ignore the age range (e.g., 0-18 years, 18-25 years, 25-30 years).
- Saving the Probe after computation saves you from having to compute it again. You just use
VisitProbe.load(path=probe_path)
.
Saved to /my_path/visit.pkl
care_site_level | care_site_id | care_site_short_name | stay_type | date | n_visit | c |
---|---|---|---|---|---|---|
Unité Fonctionnelle (UF) | 8312056386 | Care site 1 | 'Urg_Hospit' | 2019-05-01 | 233.0 | 0.841 |
Unité Fonctionnelle (UF) | 8312056386 | Care site 1 | 'All' | 2021-04-01 | 393.0 | 0.640 |
Pôle/DMU | 8312027648 | Care site 2 | 'Urg_Hospit' | 2017-03-01 | 204.0 | 0.497 |
Pôle/DMU | 8312027648 | Care site 2 | 'All' | 2018-08-01 | 22.0 | 0.274 |
Hôpital | 8312022130 | Care site 3 | 'Urg_Hospit' | 2022-02-01 | 9746.0 | 0.769 |
2.2 Filter your Probe
In this example, we are interested in three hospitals. We consequently filter data before any further analysis.
from edsteva.probes import VisitProbe
care_site_short_name = ["Hôpital-1", "Hôpital-2", "Hôpital-3"]
filtered_visit = VisitProbe()
filtered_visit.load(path=probe_path)
filtered_visit.filter_care_site(care_site_short_names=care_site_short_name) # (1)
- To filter care sites there is a dedicated method that also includes all upper and lower levels care sites related to the selected care sites.
2.3 Visualize your Probe
Interactive dashboard
Interactive dashboards can be used to visualize the average completeness predictor \(c(t)\) of the selected care sites and stay types.
from edsteva.viz.dashboards import probe_dashboard
probe_dashboard(
probe=filtered_visit,
)
Static plot
If you need a static plot for a report, a paper or anything else, you can use the probe_plot()
function. It returns the top plot of the dashboard without the interactive filters. Consequently, you have to specify the filters in the inputs of the function.
from edsteva.viz.plots import probe_plot
plot_path = "my_path/visit.html"
stay_type = "All"
probe_plot(
probe=filtered_visit,
care_site_level="Hospital",
stay_type=stay_type,
save_path=plot_path, # (1)
)
- If a
save_path
is specified, it'll save your plot in the specified path.
3. Choose a Model or create a new Model
A Model is a python class designed to fit a function \(f_\Theta(t)\) to each completeness predictor \(c(t)\) of a Probe. The fit process estimates the coefficients \(\Theta\) with metrics to characterize the temporal variability of data availability.
In this example, the model fits a step function \(f_{t_0, c_0}(t)\) to the completeness predictor \(c(t)\) with coefficients \(\Theta = (t_0, c_0)\):
- the characteristic time \(t_0\) estimates the time after which the data is available.
- the characteristic value \(c_0\) estimates the stabilized routine completeness.
It also computes the following \(error\) metric that estimates the stability of the data after \(t_0\):
This step function Model is available in the library.
3.1 Fit your Model
The fit
method takes a Probe as input, it estimates the coefficients, for example by minimizing a quadratic loss function and computes the metrics. Finally, it stores the estimated coefficients and the computed metrics in the estimates
attribute of the Model
.
from edsteva.models.step_function import StepFunction
model_path = "my_path/fitted_visit.pkl"
step_function_model = StepFunction()
step_function_model.fit(probe=filtered_visit)
step_function_model.save(model_path) # (1)
step_function_model.estimates.head()
- Saving the Model after fitting saves you from having to fit it again. You just use
StepFunction.load(path=model_path)
.
Saved to /my_path/fitted_visit.pkl
care_site_level | care_site_id | stay_type | t_0 | c_0 | error |
---|---|---|---|---|---|
Pôle/DMU | 8312056386 | 'Urg_Hospit' | 2019-05-01 | 0.397 | 0.040 |
Pôle/DMU | 8312056386 | 'All' | 2017-04-01 | 0.583 | 0.028 |
Pôle/DMU | 8312027648 | 'Urg_Hospit' | 2021-03-01 | 0.677 | 0.022 |
Pôle/DMU | 8312027648 | 'All' | 2018-08-01 | 0.764 | 0.014 |
Pôle/DMU | 8312022130 | 'Urg_Hospit' | 2022-02-01 | 0.652 | 0.027 |
3.2 Visualize your fitted Probe
Interactive dashboard
Interactive dashboards can be used to visualize the average completeness predictor \(c(t)\) along with the fitted step function of the selected care sites and stay types.
from edsteva.viz.dashboards import probe_dashboard
probe_dashboard(
probe=filtered_visit,
fitted_model=step_function_model,
)
Static plot
If you need a static plot for a report, a paper or anything else, you can use the probe_plot()
function. It returns the top plot of the dashboard without the interactive filters. Consequently, you have to specify the filters in the inputs of the function.
from edsteva.viz.plots import probe_plot
plot_path = "my_path/fitted_visit.html"
stay_type = "All"
probe_plot(
probe=filtered_visit,
fitted_model=step_function_model,
care_site_level="Hospital",
stay_type=stay_type,
save_path=plot_path, # (1)
)
save_path
is specified, it'll save your plot in the specified path.
4. Set the thresholds to fix the deployment bias
Now, that we have estimated \(t_0\), \(c_0\) and \(error\) for each care site and each stay type, one can set a threshold for each estimate in order to select only the care sites where the visits are available over the period of interest.
4.1 Visualize estimates distributions
Visualizing the density plots and the medians of the estimates can help you setting the thresholds' values.
from edsteva.viz.plots import estimates_densities_plot
estimates_densities_plot(
probe=filtered_visit,
fitted_model=step_function_model,
)
4.2 Set the thresholds
The estimates dashboard provides a representation of the overall deviation from the Model on the top and interactive sliders on the bottom that allows you to vary the thresholds. The idea is to set the thresholds that keep the most care sites while having an acceptable overall deviation.
from edsteva.viz.dashboards import estimates_dashboard
estimates_dashboard(
probe=filtered_visit,
fitted_model=step_function_model,
)
The threshold dashboard is available here.
4.3 Fix the deployment bias
Once you set the thresholds, you can extract for each stay type the care sites for which data availability is estimated to be stable over the entire study period.
t_0_max = "2020-01-01" # (1)
c_0_min = 0.6 # (2)
error_max = 0.05 # (3)
estimates = step_function_model.estimates
selected_care_site = estimates[
(estimates["t_0"] <= t_0_max)
& (estimates["c_0"] >= c_0_min)
& (estimates["error"] <= error_max)
]
print(selected_care_site["care_site_id"].unique())
- In this example the study period starts on January 1, 2020.
- The characteristic value \(c_0\) estimates the stabilized routine completeness. As we want the selected care sites to have a good completeness after \(t_0\), one can for example set the threshold around the median (cf. distribution) to keep half of the care sites with the highest completeness after \(t_0\).
- \(error\) estimates the stability of the data after \(t_0\). As we want the selected care sites to be stable after \(t_0\), one can set the threshold around the median (cf. distribution) to keep half of the care sites with the lowest error after \(t_0\).
[8312056386, 8457691845, 8745619784, 8314578956, 8314548764, 8542137845]
In this example, \(c_0\) and \(error\) thresholds have been set around the median (cf. distribution). However, this method is arbitrary and you have to find the appropriate method for your study with the help of the estimate dashboard.
Limitations
EDS-TeVa provides modelling tools to characterize the temporal variability of your data, it does not intend to provide direct methods to fix the deployment bias. As an open-source library, EDS-TeVa is also here to host a discussion in order to facilitate collective methodological convergence on flexible solutions. The default methods proposed in this example is intended to be reviewed and challenged by the user community.
Make it your own
The working example above describes the canonical usage workflow. However, you would probably need different Probes, Models, Visualizations and methods to set the thresholds for your projects. The components already available in the library are listed below but if it doesn't meet your requirements, you are encouraged to create your own.
Contribution
If you managed to implement your own component, or even if you just thought about a new component do not hesitate to share it with the community by following the contribution guidelines. Contributions are welcome, and they are greatly appreciated! Every little bit helps, and credit will always be given.
Available components
The VisitProbe
computes \(c_{visit}(t)\) the availability of administrative stays:
Where \(n_{visit}(t)\) is the number of administrative stays, \(t\) is the month and \(n_{max} = \max_{t}(n_{visit}(t))\).
If the maximum number of records per month \(n_{max}\) is equal to 0, we consider that the completeness predictor \(c(t)\) is also equal to 0.
from edsteva.probes import VisitProbe
visit = VisitProbe()
visit.compute(
data,
stay_types={
"Urg": "urgence",
"Hospit": "hospitalisés",
"Urg_Hospit": "urgence|hospitalisés",
},
)
visit.predictor.head()
care_site_level | care_site_id | care_site_short_name | stay_type | date | n_visit | c |
---|---|---|---|---|---|---|
Unité Fonctionnelle (UF) | 8312056386 | Care site 1 | 'Urg' | 2019-05-01 | 233.0 | 0.841 |
Unité Fonctionnelle (UF) | 8312056386 | Care site 1 | 'Urg' | 2021-04-01 | 393.0 | 0.640 |
Pôle/DMU | 8312027648 | Care site 2 | 'Hospit' | 2017-03-01 | 204.0 | 0.497 |
Pôle/DMU | 8312027648 | Care site 2 | 'Urg' | 2018-08-01 | 22.0 | 0.274 |
Hôpital | 8312022130 | Care site 3 | 'Urg_Hospit' | 2022-02-01 | 9746.0 | 0.769 |
The NoteProbe
computes \(c_{note}(t)\) the availability of clinical documents:
The per_visit_default
algorithm computes \(c_(t)\) the availability of clinical documents linked to patients' administrative stays:
Where \(n_{visit}(t)\) is the number of administrative stays, \(n_{with\,doc}\) the number of visits having at least one document and \(t\) is the month.
If the number of visits \(n_{visit}(t)\) is equal to 0, we consider that the completeness predictor \(c(t)\) is also equal to 0.
from edsteva.probes import NoteProbe
note = Note(completeness_predictor="per_visit_default")
note.compute(
data,
stay_types={
"Urg": "urgence",
"Hospit": "hospitalisés",
"Urg_Hospit": "urgence|hospitalisés",
},
note_types={
"All": ".*",
"CRH": "crh",
"Ordonnance": "ordo",
"CR Passage Urgences": "urge",
},
)
note.predictor.head()
care_site_level | care_site_id | care_site_short_name | stay_type | note_type | date | n_visit | n_visit_with_note | c |
---|---|---|---|---|---|---|---|---|
Unité Fonctionnelle (UF) | 8312056386 | Care site 1 | 'Urg' | 'All' | 2019-05-01 | 233.0 | 196.0 | 0.841 |
Unité Fonctionnelle (UF) | 8653815660 | Care site 1 | 'Hospit' | 'CRH' | 2017-04-01 | 393.0 | 252.0 | 0.640 |
Pôle/DMU | 8312027648 | Care site 2 | 'Hospit' | 'CRH' | 2021-03-01 | 204.0 | 101.0 | 0.497 |
Pôle/DMU | 8312056379 | Care site 2 | 'Urg' | 'Ordonnance' | 2018-08-01 | 22.0 | 6.0 | 0.274 |
Hôpital | 8312022130 | Care site 3 | 'Urg_Hospit' | 'CR Passage Urgences' | 2022-02-01 | 9746.0 | 7495.0 | 0.769 |
The per_note_default
algorithm computes \(c_(t)\) the availability of clinical documents as follow:
Where \(n_{note}(t)\) is the number of clinical documents, \(t\) is the month and \(n_{max} = \max_{t}(n_{note}(t))\).
If the maximum number of recorded notes per month \(n_{max}\) is equal to 0, we consider that the completeness predictor \(c(t)\) is also equal to 0.
from edsteva.probes import NoteProbe
note = Note(completeness_predictor="per_note_default")
note.compute(
data,
stay_types={
"Urg": "urgence",
"Hospit": "hospitalisés",
"Urg_Hospit": "urgence|hospitalisés",
},
note_types={
"All": ".*",
"CRH": "crh",
"Ordonnance": "ordo",
"CR Passage Urgences": "urge",
},
)
note.predictor.head()
care_site_level | care_site_id | care_site_short_name | stay_type | note_type | date | n_note | c |
---|---|---|---|---|---|---|---|
Unité Fonctionnelle (UF) | 8312056386 | Care site 1 | 'Urg' | 'All' | 2019-05-01 | 233.0 | 0.841 |
Unité Fonctionnelle (UF) | 8653815660 | Care site 1 | 'Hospit' | 'CRH' | 2017-04-01 | 393.0 | 0.640 |
Pôle/DMU | 8312027648 | Care site 2 | 'Hospit' | 'CRH' | 2021-03-01 | 204.0 | 0.497 |
Pôle/DMU | 8312056379 | Care site 2 | 'Urg' | 'Ordonnance' | 2018-08-01 | 22.0 | 0.274 |
Hôpital | 8312022130 | Care site 3 | 'Urg_Hospit' | 'CR Passage Urgences' | 2022-02-01 | 9746.0 | 0.769 |
The ConditionProbe
computes \(c_{condition}(t)\) the availability of claim data:
The per_visit_default
algorithm computes \(c_(t)\) the availability of claim data linked to patients' administrative stays:
Where \(n_{visit}(t)\) is the number of administrative stays, \(n_{with\,condition}\) the number of stays having at least one claim code (e.g. ICD-10) recorded and \(t\) is the month.
If the number of visits \(n_{visit}(t)\) is equal to 0, we consider that the completeness predictor \(c(t)\) is also equal to 0.
Care site level
AREM claim data are only available at hospital level.
from edsteva.probes import ConditionProbe
condition = ConditionProbe(completeness_predictor="per_visit_default")
condition.compute(
data,
stay_types={
"Hospit": "hospitalisés",
},
diag_types={
"All": ".*",
"DP/DR": "DP|DR",
},
condition_types={
"All": ".*",
"Pulmonary_embolism": "I26",
},
source_systems=["AREM", "ORBIS"],
)
condition.predictor.head()
care_site_level | care_site_id | care_site_short_name | stay_type | diag_type | condition_type | source_systems | date | n_visit | n_visit_with_condition | c |
---|---|---|---|---|---|---|---|---|---|---|
Hôpital | 8312057527 | Care site 1 | 'Hospit' | 'All' | 'Pulmonary_embolism' | AREM | 2019-05-01 | 233.0 | 196.0 | 0.841 |
Hôpital | 8312057527 | Care site 1 | 'Hospit' | 'DP/DR' | 'Pulmonary_embolism' | AREM | 2021-04-01 | 393.0 | 252.0 | 0.640 |
Hôpital | 8312027648 | Care site 2 | 'Hospit' | 'All' | 'Pulmonary_embolism' | AREM | 2017-03-01 | 204.0 | 101.0 | 0.497 |
Unité Fonctionnelle (UF) | 8312027648 | Care site 2 | 'Hospit' | 'All' | 'All' | ORBIS | 2018-08-01 | 22.0 | 6.0 | 0.274 |
Pôle/DMU | 8312022130 | Care site 3 | 'Hospit' | 'DP/DR' | 'Pulmonary_embolism' | ORBIS | 2022-02-01 | 9746.0 | 7495.0 | 0.769 |
The per_condition_default
algorithm computes \(c_(t)\) the availability of claim data as follow:
Where \(n_{condition}(t)\) is the number of claim codes (e.g. ICD-10) recorded, \(t\) is the month and \(n_{max} = \max_{t}(n_{condition}(t))\).
If the maximum number of recorded diagnosis per month \(n_{max}\) is equal to 0, we consider that the completeness predictor \(c(t)\) is also equal to 0.
from edsteva.probes import ConditionProbe
condition = ConditionProbe(completeness_predictor="per_condition_default")
condition.compute(
data,
stay_types={
"All": ".*",
"Hospit": "hospitalisés",
},
diag_types={
"All": ".*",
"DP/DR": "DP|DR",
},
condition_types={
"All": ".*",
"Pulmonary_embolism": "I26",
},
source_systems=["AREM", "ORBIS"],
)
condition.predictor.head()
care_site_level | care_site_id | care_site_short_name | stay_type | diag_type | condition_type | source_systems | date | n_condition | c |
---|---|---|---|---|---|---|---|---|---|
Hôpital | 8312057527 | Care site 1 | 'Hospit' | 'All' | 'Pulmonary_embolism' | AREM | 2019-05-01 | 233.0 | 0.841 |
Hôpital | 8312057527 | Care site 1 | 'Hospit' | 'DP/DR' | 'Pulmonary_embolism' | AREM | 2021-04-01 | 393.0 | 0.640 |
Hôpital | 8312027648 | Care site 2 | 'Hospit' | 'All' | 'Pulmonary_embolism' | AREM | 2017-03-01 | 204.0 | 0.497 |
Unité Fonctionnelle (UF) | 8312027648 | Care site 2 | 'Hospit' | 'All' | 'All' | ORBIS | 2018-08-01 | 22.0 | 0.274 |
Pôle/DMU | 8312022130 | Care site 3 | 'Hospit' | 'DP/DR' | 'Pulmonary_embolism' | ORBIS | 2022-02-01 | 9746.0 | 0.769 |
The BiologyProbe
computes \(c_(t)\) the availability of laboratory data:
The per_visit_default
algorithm computes \(c_(t)\) the availability of laboratory data linked to patients' administrative stays:
Where \(n_{visit}(t)\) is the number of administrative stays, \(n_{with\,biology}\) the number of stays having at least one biological measurement recorded and \(t\) is the month.
If the number of visits \(n_{visit}(t)\) is equal to 0, we consider that the completeness predictor \(c(t)\) is also equal to 0.
Care site level
Laboratory data are only available at hospital level.
from edsteva.probes import BiologyProbe
biology = BiologyProbe(completeness_predictor="per_visit_default")
biology.compute(
data,
stay_types={
"Hospit": "hospitalisés",
},
concepts_sets={
"Créatinine": "E3180|G1974|J1002|A7813|A0094|G1975|J1172|G7834|F9409|F9410|C0697|H4038|F2621",
"Leucocytes": "A0174|K3232|H6740|E4358|C9784|C8824|E6953",
},
)
biology.predictor.head()
care_site_level | care_site_id | care_site_short_name | stay_type | concepts_sets | date | n_visit | n_visit_with_measurement | c |
---|---|---|---|---|---|---|---|---|
Hôpital | 8312057527 | Care site 1 | 'Hospit' | 'Créatinine' | 2019-05-01 | 233.0 | 196.0 | 0.841 |
Hôpital | 8312057527 | Care site 1 | 'Hospit' | 'Leucocytes' | 2021-04-01 | 393.0 | 252.0 | 0.640 |
Hôpital | 8312027648 | Care site 2 | 'Hospit' | 'Créatinine' | 2017-03-01 | 204.0 | 101.0 | 0.497 |
Hôpital | 8312027648 | Care site 2 | 'Hospit' | 'Leucocytes' | 2018-08-01 | 22.0 | 6.0 | 0.274 |
Hôpital | 8312022130 | Care site 3 | 'Hospit' | 'Leucocytes' | 2022-02-01 | 9746.0 | 7495.0 | 0.769 |
The per_measurement_default
algorithm computes \(c_(t)\) the availability of biological measurements:
Where \(n_{biology}(t)\) is the number of biological measurements, \(t\) is the month and \(n_{max} = \max_{t}(n_{biology}(t))\).
If the maximum number of recorded biological measurements per month \(n_{max}\) is equal to 0, we consider that the completeness predictor \(c(t)\) is also equal to 0.
Care site level
Laboratory data are only available at hospital level.
from edsteva.probes import BiologyProbe
biology = BiologyProbe(completeness_predictor="per_measurement_default")
biology.compute(
data,
stay_types={
"Hospit": "hospitalisés",
},
concepts_sets={
"Créatinine": "E3180|G1974|J1002|A7813|A0094|G1975|J1172|G7834|F9409|F9410|C0697|H4038|F2621",
"Leucocytes": "A0174|K3232|H6740|E4358|C9784|C8824|E6953",
},
)
biology.predictor.head()
care_site_level | care_site_id | care_site_short_name | stay_type | concepts_sets | date | n_measurement | c |
---|---|---|---|---|---|---|---|
Hôpital | 8312057527 | Care site 1 | 'Hospit' | 'Créatinine' | 2019-05-01 | 233.0 | 0.841 |
Hôpital | 8312057527 | Care site 1 | 'Hospit' | 'Leucocytes' | 2021-04-01 | 393.0 | 0.640 |
Hôpital | 8312027648 | Care site 2 | 'Hospit' | 'Créatinine' | 2017-03-01 | 204.0 | 0.497 |
Unité Fonctionnelle (UF) | 8312027648 | Care site 2 | 'Hospit' | 'Leucocytes' | 2018-08-01 | 22.0 | 0.274 |
Pôle/DMU | 8312022130 | Care site 3 | 'Hospit' | 'Leucocytes' | 2022-02-01 | 9746.0 | 0.769 |
The StepFunction
fits a step function \(f_{t_0, c_0}(t)\) with coefficients \(\Theta = (t_0, c_0)\) on a completeness predictor \(c(t)\):
- the characteristic time \(t_0\) estimates the time after which the data is available.
- the characteristic value \(c_0\) estimates the stabilized routine completeness.
The default metric computed is the mean squared error after \(t_0\):
- \(error\) estimates the stability of the data after \(t_0\).
Custom metric
You can define your own metric if this one doesn't meet your requirements.
The available algorithms used to fit the step function are listed below:
Custom algo
You can define your own algorithm if they don't meet your requirements.
This algorithm computes the estimated coefficients \(\hat{t_0}\) and \(\hat{c_0}\) by minimizing the loss function \(\mathcal{L}(t_0, c_0)\):
Default loss function \(\mathcal{l}\)
The loss function is \(l_2\) by default: $$ \mathcal{l}(c(t), f_{t_0, c_0}(t)) = |c(t) - f_{t_0, c_0}(t)|^2 $$
Optimal estimates
For complexity purposes, this algorithm has been implemented with a dependency relation between \(c_0\) and \(t_0\) derived from the optimal estimates using the \(l_2\) loss function. For more informations, you can have a look on the source code.
In this algorithm, \(\hat{c_0}\) is directly estimated as the \(x^{th}\) quantile of the completeness predictor \(c(t)\), where \(x\) is a number between 0 and 1. Then, \(\hat{t_0}\) is the first time \(c(t)\) reaches \(\hat{c_0}\).
Default quantile \(x\)
The default quantile is \(x = 0.8\).
from edsteva.models.step_function import StepFunction
step_function_model = StepFunction()
step_function_model.fit(probe)
step_function_model.estimates.head()
care_site_level | care_site_id | stay_type | t_0 | c_0 | error |
---|---|---|---|---|---|
Unité Fonctionnelle (UF) | 8312056386 | 'Urg' | 2019-05-01 | 0.397 | 0.040 |
Unité Fonctionnelle (UF) | 8312056386 | 'All' | 2017-04-01 | 0.583 | 0.028 |
Pôle/DMU | 8312027648 | 'Hospit' | 2021-03-01 | 0.677 | 0.022 |
Pôle/DMU | 8312027648 | 'All' | 2018-08-01 | 0.764 | 0.014 |
Hôpital | 8312022130 | 'Hospit' | 2022-02-01 | 0.652 | 0.027 |
The RectangleFunction
fits a step function \(f_{t_0, c_0, t_1}(t)\) with coefficients \(\Theta = (t_0, c_0, t_1)\) on a completeness predictor \(c(t)\):
- the characteristic time \(t_0\) estimates the time after which the data is available.
- the characteristic time \(t_1\) estimates the time after which the data is not available anymore.
- the characteristic value \(c_0\) estimates the completeness between \(t_0\) and \(t_1\).
The default metric computed is the mean squared error between \(t_0\) and \(t_1\):
- \(error\) estimates the stability of the data between \(t_0\) and \(t_1\).
Custom metric
You can define your own metric if this one doesn't meet your requirements.
The available algorithms used to fit the step function are listed below:
Custom algo
You can define your own algorithm if they don't meet your requirements.
This algorithm computes the estimated coefficients \(\hat{t_0}\), \(\hat{c_0}\) and \(\hat{t_1}\) by minimizing the loss function \(\mathcal{L}(t_0, c_0, t_1)\):
Default loss function \(\mathcal{l}\)
The loss function is \(l_2\) by default: $$ \mathcal{l}(c(t), f_{t_0, c_0, t_1}(t)) = |c(t) - f_{t_0, c_0, t_1}(t)|^2 $$
Optimal estimates
For complexity purposes, this algorithm has been implemented with a dependency relation between \(c_0\) and \(t_0\) derived from the optimal estimates using the \(l_2\) loss function. For more informations, you can have a look on the source code.
from edsteva.models.rectangle_function import RectangleFunction
rectangle_function_model = RectangleFunction()
rectangle_function_model.fit(probe)
rectangle_function_model.estimates.head()
care_site_level | care_site_id | stay_type | t_0 | c_0 | t_1 | error |
---|---|---|---|---|---|---|
Unité Fonctionnelle (UF) | 8312056386 | 'Urg' | 2019-05-01 | 0.397 | 2020-05-01 | 0.040 |
Unité Fonctionnelle (UF) | 8312056386 | 'All' | 2017-04-01 | 0.583 | 2013-04-01 | 0.028 |
Pôle/DMU | 8312027648 | 'Hospit' | 2021-03-01 | 0.677 | 2022-03-01 | 0.022 |
Pôle/DMU | 8312027648 | 'All' | 2018-08-01 | 0.764 | 2019-08-01 | 0.014 |
Hôpital | 8312022130 | 'Hospit' | 2022-02-01 | 0.652 | 2022-08-01 | 0.027 |
The library provides interactive dashboards that let you set any combination of care sites, stay types and other columns if included in the Probe. You can only export a dashboard in HTML format.
The probe_dashboard()
returns:
- On the top, the aggregated variable is the average completeness predictor \(c(t)\) over time \(t\) with the prediction \(\hat{c}(t)\) if the fitted Model is specified.
- On the bottom, the interactive filters are all the columns included in the Probe (such as time, care site, number of visits...etc.).
from edsteva.viz.dashboards import probe_dashboard
probe_dashboard(
probe=probe,
fitted_model=step_function_model,
care_site_level=care_site_level,
)
The normalized_probe_dashboard()
returns a representation of the overall deviation from the Model:
- On the top, the aggregated variable is a normalized completeness predictor \(\frac{c(t)}{c_0}\) over normalized time \(t - t_0\).
- On the bottom, the interactive filters are all the columns included in the Probe (such as time, care site, number of visits...etc.) with all the Model coefficients and metrics included in the Model.
from edsteva.viz.dashboards import normalized_probe_dashboard
normalized_probe_dashboard(
probe=probe,
fitted_model=step_function_model,
care_site_level=care_site_level,
)
An example is available here.
The library provides static plots that you can export in png or svg. As it is less interactive, you may specify the filters in the inputs of the functions.
The probe_plot()
returns the top plot of the probe_dashboard()
: the normalized completeness predictor \(\frac{c(t)}{c_0}\) over normalized time \(t - t_0\).
from edsteva.viz.plots import probe_plot
probe_plot(
probe=probe,
fitted_model=step_function_model,
care_site_level=care_site_level,
stay_type=stay_type,
save_path=plot_path,
)
The normalized_probe_plot()
returns the top plot of the normalized_probe_dashboard()
. Consequently, you have to specify the filters in the inputs of the function.
from edsteva.viz.plots import normalized_probe_plot
normalized_probe_plot(
probe=probe,
fitted_model=step_function_model,
t_min=-15,
t_max=15,
save_path=plot_path,
)
The estimates_densities_plot()
returns the density plot and the median of each estimate. It can help you to set the thresholds.
from edsteva.viz.plots import estimates_densities_plot
estimates_densities_plot(
fitted_model=step_function_model,
)
-
Samuel G Finlayson, Adarsh Subbaswamy, Karandeep Singh, John Bowers, Annabel Kupke, Jonathan Zittrain, Isaac S Kohane, and Suchi Saria. The clinician and dataset shift in artificial intelligence. The New England journal of medicine, 385(3):283, 2021. ↩