Skip to content

EDS-TeVa

EDS-TeVa Documentation

Documentation PyPI Codecov Black Poetry Supported Python versions Ruff


Documentation: https://aphp.github.io/edsteva/latest/

Source Code: https://github.com/aphp/edsteva


Getting Started

EDS-TeVa provides a set of tools to characterize the temporal variability of data induced by the dynamics of the clinical IT system.

Context

Real world data is subject to important temporal drifts that may be caused by a variety of factors1. In particular, data availability fluctuates with the deployment of clinical softwares and their clinical use. The dynamics of software deployment and adoption is not trivial as it depends on the care site and on the category of data that are considered.

Installation

Requirements

EDS-TeVa stands on the shoulders of Spark 2.4 which runs on Java 8 and Python ~3.7.1, it is essential to:

  • Install a version of Python \(\geq 3.7.1\) and \(< 3.8\).
  • Install OpenJDK 8, an open-source reference implementation of Java 8 wit the following command lines:

    $ sudo apt-get update
    $ sudo apt-get install openjdk-8-jdk
    ---> 100%
    

    For more details, check this installation guide

    $ brew tap AdoptOpenJDK/openjdk
    $ brew install --cask adoptopenjdk8
    ---> 100%
    

    For more details, check this installation guide

    Follow this installation guide

You can install EDS-TeVa through pip:

$ pip install edsteva
---> 100%
color:green Successfully installed edsteva

We recommend pinning the library version in your projects, or use a strict package manager like Poetry.

pip install edsteva==0.2.8

Working example: administrative records relative to visits

Let's consider a basic category of data: administrative records relative to visits. A visit is characterized by a care site, a length of stay, a stay type (full hospitalisation, emergency, consultation, etc.) and other characteristics. In this example, the objective is to estimate the availability of visits records with respect to time, care site and stay type.

1. Load your data

As detailled in the dedicated section, EDS-TeVa is expecting to work with Pandas or Koalas DataFrames. We provide various connectors to facilitate data fetching, namely a Hive connector, a Postgres connector and a LocalData.

from edsteva.io import HiveData

db_name = "my_db"
tables_to_load = [
    "visit_occurrence",
    "visit_detail",
    "care_site",
    "fact_relationship",
]
data = HiveData(db_name, tables_to_load=tables_to_load)
data.visit_occurrence  # (1)
  1. With this connector, visit_occurrence will be a Koalas DataFrame
from edsteva.io import PostgresData

db_name = "my_db"
schema = "my_schema"
user = "my_username"
data = PostgresData(db_name, schema=schema, user=user)  # (1)
data.visit_occurrence  # (2)
  1. This connector expects a .pgpass file storing the connection parameters
  2. With this connector, visit_occurrence will be a Pandas DataFrame
import os
from edsteva.io import LocalData

folder = os.path.abspath(MY_FOLDER_PATH)

data = LocalData(folder)  # (1)
data.visit_occurrence  # (2)
  1. This connector expects a folder with a file per table to load.
  2. With this connector, visit_occurrence will be a Pandas DataFrame

2. Choose a Probe or create a new Probe

Probe

A Probe is a python class designed to compute a completeness predictor \(c(t)\) that characterizes data availability of a target variable over time \(t\).

In this example, \(c(t)\) predicts the availability of administrative records relative to visits. It is defined for each characteristic (care site, stay type, age range, length of stay, etc.) as the number of visits \(n_{visit}(t)\) per month \(t\), normalized by the maximum number of records per month \(n_{max} = \max_{t}(n_{visit}(t))\) computed over the entire study period:

\[ c(t) = \frac{n_{visit}(t)}{n_{max}} \]

If the maximum number of records per month \(n_{max}\) is equal to 0, we consider that the completeness predictor \(c(t)\) is also equal to 0.

The VisitProbe is already available by default in the library:

2.1 Compute your Probe

The compute() method takes a Data object as input and stores the computed completeness predictor \(c(t)\) in the predictor attribute of a Probe:

from edsteva.probes import VisitProbe

probe_path = "my_path/visit.pkl"

visit = VisitProbe()
visit.compute(
    data,
    care_site_levels=["Hospital", "Pole", "UF"],  # (1)
    stay_types={
        "All": ".*",
        "Urg_Hospit": "urgence|hospitalisés",  # (2)
    },
    care_site_specialties=None,  # (3)
    stay_sources=None,  # (4)
    length_of_stays=None,  # (5)
    provenance_sources=None,  # (6)
    age_ranges=None,  # (7)
)
visit.save(path=probe_path)  # (8)
visit.predictor.head()
  1. The care sites are articulated into levels (cf. AP-HP's reference structure). Here, as an example, we are only interested in those three levels.
  2. The stay_types argument expects a python dictionary with labels as keys and regex as values.
  3. In this example we want to ignore the care site specialty (e.g., Cardiology, Pediatrics).
  4. In this example we want to ignore the stay source (e.g., MCO, SSR, PSY).
  5. In this example we want to ignore the length of stay (e.g., \(>=\) 7 days, \(<=\) 2 days).
  6. In this example we want to ignore the provenance source (e.g., service d'urgence, d'une unité de soins de courte durée).
  7. In this example we want to ignore the age range (e.g., 0-18 years, 18-25 years, 25-30 years).
  8. Saving the Probe after computation saves you from having to compute it again. You just use VisitProbe.load(path=probe_path).

Saved to /my_path/visit.pkl

care_site_level care_site_id care_site_short_name stay_type date n_visit c
Unité Fonctionnelle (UF) 8312056386 Care site 1 'Urg_Hospit' 2019-05-01 233.0 0.841
Unité Fonctionnelle (UF) 8312056386 Care site 1 'All' 2021-04-01 393.0 0.640
Pôle/DMU 8312027648 Care site 2 'Urg_Hospit' 2017-03-01 204.0 0.497
Pôle/DMU 8312027648 Care site 2 'All' 2018-08-01 22.0 0.274
Hôpital 8312022130 Care site 3 'Urg_Hospit' 2022-02-01 9746.0 0.769

2.2 Filter your Probe

In this example, we are interested in three hospitals. We consequently filter data before any further analysis.

from edsteva.probes import VisitProbe

care_site_short_name = ["Hôpital-1", "Hôpital-2", "Hôpital-3"]

filtered_visit = VisitProbe()
filtered_visit.load(path=probe_path)
filtered_visit.filter_care_site(care_site_short_names=care_site_short_name)  # (1)
  1. To filter care sites there is a dedicated method that also includes all upper and lower levels care sites related to the selected care sites.

2.3 Visualize your Probe

Interactive dashboard

Interactive dashboards can be used to visualize the average completeness predictor \(c(t)\) of the selected care sites and stay types.

from edsteva.viz.dashboards import probe_dashboard

probe_dashboard(
    probe=filtered_visit,
)
Interactive dashboard is available here

Static plot

If you need a static plot for a report, a paper or anything else, you can use the probe_plot() function. It returns the top plot of the dashboard without the interactive filters. Consequently, you have to specify the filters in the inputs of the function.

from edsteva.viz.plots import probe_plot

plot_path = "my_path/visit.html"
stay_type = "All"

probe_plot(
    probe=filtered_visit,
    care_site_level="Hospital",
    stay_type=stay_type,
    save_path=plot_path,  # (1)
)
  1. If a save_path is specified, it'll save your plot in the specified path.

{ "schema-url": "assets/charts/visit.json" }

3. Choose a Model or create a new Model

Model

A Model is a python class designed to fit a function \(f_\Theta(t)\) to each completeness predictor \(c(t)\) of a Probe. The fit process estimates the coefficients \(\Theta\) with metrics to characterize the temporal variability of data availability.

In this example, the model fits a step function \(f_{t_0, c_0}(t)\) to the completeness predictor \(c(t)\) with coefficients \(\Theta = (t_0, c_0)\):

\[ f_{t_0, c_0}(t) = c_0 \ \mathbb{1}_{t \geq t_0}(t) \]
  • the characteristic time \(t_0\) estimates the time after which the data is available.
  • the characteristic value \(c_0\) estimates the stabilized routine completeness.

It also computes the following \(error\) metric that estimates the stability of the data after \(t_0\):

\[ \begin{aligned} error & = \frac{\sum_{t_0 \leq t \leq t_{max}} \epsilon(t)^2}{t_{max} - t_0} \\ \epsilon(t) & = f_{t_0, c_0}(t) - c(t) \end{aligned} \]

This step function Model is available in the library.

3.1 Fit your Model

The fit method takes a Probe as input, it estimates the coefficients, for example by minimizing a quadratic loss function and computes the metrics. Finally, it stores the estimated coefficients and the computed metrics in the estimates attribute of the Model.

from edsteva.models.step_function import StepFunction

model_path = "my_path/fitted_visit.pkl"

step_function_model = StepFunction()
step_function_model.fit(probe=filtered_visit)
step_function_model.save(model_path)  # (1)
step_function_model.estimates.head()
  1. Saving the Model after fitting saves you from having to fit it again. You just use StepFunction.load(path=model_path).

Saved to /my_path/fitted_visit.pkl

care_site_level care_site_id stay_type t_0 c_0 error
Pôle/DMU 8312056386 'Urg_Hospit' 2019-05-01 0.397 0.040
Pôle/DMU 8312056386 'All' 2017-04-01 0.583 0.028
Pôle/DMU 8312027648 'Urg_Hospit' 2021-03-01 0.677 0.022
Pôle/DMU 8312027648 'All' 2018-08-01 0.764 0.014
Pôle/DMU 8312022130 'Urg_Hospit' 2022-02-01 0.652 0.027

3.2 Visualize your fitted Probe

Interactive dashboard

Interactive dashboards can be used to visualize the average completeness predictor \(c(t)\) along with the fitted step function of the selected care sites and stay types.

from edsteva.viz.dashboards import probe_dashboard

probe_dashboard(
    probe=filtered_visit,
    fitted_model=step_function_model,
)
Interactive dashboard is available here.

Static plot

If you need a static plot for a report, a paper or anything else, you can use the probe_plot() function. It returns the top plot of the dashboard without the interactive filters. Consequently, you have to specify the filters in the inputs of the function.

from edsteva.viz.plots import probe_plot

plot_path = "my_path/fitted_visit.html"
stay_type = "All"

probe_plot(
    probe=filtered_visit,
    fitted_model=step_function_model,
    care_site_level="Hospital",
    stay_type=stay_type,
    save_path=plot_path,  # (1)
)
1. If a save_path is specified, it'll save your plot in the specified path.

{ "schema-url": "assets/charts/fitted_visit.json" }

4. Set the thresholds to fix the deployment bias

Now, that we have estimated \(t_0\), \(c_0\) and \(error\) for each care site and each stay type, one can set a threshold for each estimate in order to select only the care sites where the visits are available over the period of interest.

4.1 Visualize estimates distributions

Visualizing the density plots and the medians of the estimates can help you setting the thresholds' values.

from edsteva.viz.plots import estimates_densities_plot

estimates_densities_plot(
    probe=filtered_visit,
    fitted_model=step_function_model,
)
{ "schema-url": "assets/charts/estimates_densities.json" }

4.2 Set the thresholds

The estimates dashboard provides a representation of the overall deviation from the Model on the top and interactive sliders on the bottom that allows you to vary the thresholds. The idea is to set the thresholds that keep the most care sites while having an acceptable overall deviation.

from edsteva.viz.dashboards import estimates_dashboard

estimates_dashboard(
    probe=filtered_visit,
    fitted_model=step_function_model,
)

The threshold dashboard is available here.

4.3 Fix the deployment bias

Once you set the thresholds, you can extract for each stay type the care sites for which data availability is estimated to be stable over the entire study period.

t_0_max = "2020-01-01"  # (1)
c_0_min = 0.6  # (2)
error_max = 0.05  # (3)

estimates = step_function_model.estimates
selected_care_site = estimates[
    (estimates["t_0"] <= t_0_max)
    & (estimates["c_0"] >= c_0_min)
    & (estimates["error"] <= error_max)
]
print(selected_care_site["care_site_id"].unique())
  1. In this example the study period starts on January 1, 2020.
  2. The characteristic value \(c_0\) estimates the stabilized routine completeness. As we want the selected care sites to have a good completeness after \(t_0\), one can for example set the threshold around the median (cf. distribution) to keep half of the care sites with the highest completeness after \(t_0\).
  3. \(error\) estimates the stability of the data after \(t_0\). As we want the selected care sites to be stable after \(t_0\), one can set the threshold around the median (cf. distribution) to keep half of the care sites with the lowest error after \(t_0\).
[8312056386, 8457691845, 8745619784, 8314578956, 8314548764, 8542137845]

In this example, \(c_0\) and \(error\) thresholds have been set around the median (cf. distribution). However, this method is arbitrary and you have to find the appropriate method for your study with the help of the estimate dashboard.

Limitations

EDS-TeVa provides modelling tools to characterize the temporal variability of your data, it does not intend to provide direct methods to fix the deployment bias. As an open-source library, EDS-TeVa is also here to host a discussion in order to facilitate collective methodological convergence on flexible solutions. The default methods proposed in this example is intended to be reviewed and challenged by the user community.

Make it your own

The working example above describes the canonical usage workflow. However, you would probably need different Probes, Models, Visualizations and methods to set the thresholds for your projects. The components already available in the library are listed below but if it doesn't meet your requirements, you are encouraged to create your own.

Contribution

If you managed to implement your own component, or even if you just thought about a new component do not hesitate to share it with the community by following the contribution guidelines. Contributions are welcome, and they are greatly appreciated! Every little bit helps, and credit will always be given.

Available components

The VisitProbe computes \(c_{visit}(t)\) the availability of administrative stays:

\[ c(t) = \frac{n_{visit}(t)}{n_{max}} \]

Where \(n_{visit}(t)\) is the number of administrative stays, \(t\) is the month and \(n_{max} = \max_{t}(n_{visit}(t))\).

If the maximum number of records per month \(n_{max}\) is equal to 0, we consider that the completeness predictor \(c(t)\) is also equal to 0.

from edsteva.probes import VisitProbe

visit = VisitProbe()
visit.compute(
    data,
    stay_types={
        "Urg": "urgence",
        "Hospit": "hospitalisés",
        "Urg_Hospit": "urgence|hospitalisés",
    },
)
visit.predictor.head()
care_site_level care_site_id care_site_short_name stay_type date n_visit c
Unité Fonctionnelle (UF) 8312056386 Care site 1 'Urg' 2019-05-01 233.0 0.841
Unité Fonctionnelle (UF) 8312056386 Care site 1 'Urg' 2021-04-01 393.0 0.640
Pôle/DMU 8312027648 Care site 2 'Hospit' 2017-03-01 204.0 0.497
Pôle/DMU 8312027648 Care site 2 'Urg' 2018-08-01 22.0 0.274
Hôpital 8312022130 Care site 3 'Urg_Hospit' 2022-02-01 9746.0 0.769

The NoteProbe computes \(c_{note}(t)\) the availability of clinical documents:

The per_visit_default algorithm computes \(c_(t)\) the availability of clinical documents linked to patients' administrative stays:

\[ c(t) = \frac{n_{with\,doc}(t)}{n_{visit}(t)} \]

Where \(n_{visit}(t)\) is the number of administrative stays, \(n_{with\,doc}\) the number of visits having at least one document and \(t\) is the month.

If the number of visits \(n_{visit}(t)\) is equal to 0, we consider that the completeness predictor \(c(t)\) is also equal to 0.

from edsteva.probes import NoteProbe

note = Note(completeness_predictor="per_visit_default")
note.compute(
    data,
    stay_types={
        "Urg": "urgence",
        "Hospit": "hospitalisés",
        "Urg_Hospit": "urgence|hospitalisés",
    },
    note_types={
        "All": ".*",
        "CRH": "crh",
        "Ordonnance": "ordo",
        "CR Passage Urgences": "urge",
    },
)
note.predictor.head()
care_site_level care_site_id care_site_short_name stay_type note_type date n_visit n_visit_with_note c
Unité Fonctionnelle (UF) 8312056386 Care site 1 'Urg' 'All' 2019-05-01 233.0 196.0 0.841
Unité Fonctionnelle (UF) 8653815660 Care site 1 'Hospit' 'CRH' 2017-04-01 393.0 252.0 0.640
Pôle/DMU 8312027648 Care site 2 'Hospit' 'CRH' 2021-03-01 204.0 101.0 0.497
Pôle/DMU 8312056379 Care site 2 'Urg' 'Ordonnance' 2018-08-01 22.0 6.0 0.274
Hôpital 8312022130 Care site 3 'Urg_Hospit' 'CR Passage Urgences' 2022-02-01 9746.0 7495.0 0.769

The per_note_default algorithm computes \(c_(t)\) the availability of clinical documents as follow:

\[ c(t) = \frac{n_{note}(t)}{n_{max}} \]

Where \(n_{note}(t)\) is the number of clinical documents, \(t\) is the month and \(n_{max} = \max_{t}(n_{note}(t))\).

If the maximum number of recorded notes per month \(n_{max}\) is equal to 0, we consider that the completeness predictor \(c(t)\) is also equal to 0.

from edsteva.probes import NoteProbe

note = Note(completeness_predictor="per_note_default")
note.compute(
    data,
    stay_types={
        "Urg": "urgence",
        "Hospit": "hospitalisés",
        "Urg_Hospit": "urgence|hospitalisés",
    },
    note_types={
        "All": ".*",
        "CRH": "crh",
        "Ordonnance": "ordo",
        "CR Passage Urgences": "urge",
    },
)
note.predictor.head()
care_site_level care_site_id care_site_short_name stay_type note_type date n_note c
Unité Fonctionnelle (UF) 8312056386 Care site 1 'Urg' 'All' 2019-05-01 233.0 0.841
Unité Fonctionnelle (UF) 8653815660 Care site 1 'Hospit' 'CRH' 2017-04-01 393.0 0.640
Pôle/DMU 8312027648 Care site 2 'Hospit' 'CRH' 2021-03-01 204.0 0.497
Pôle/DMU 8312056379 Care site 2 'Urg' 'Ordonnance' 2018-08-01 22.0 0.274
Hôpital 8312022130 Care site 3 'Urg_Hospit' 'CR Passage Urgences' 2022-02-01 9746.0 0.769

The ConditionProbe computes \(c_{condition}(t)\) the availability of claim data:

The per_visit_default algorithm computes \(c_(t)\) the availability of claim data linked to patients' administrative stays:

\[ c(t) = \frac{n_{with\,condition}(t)}{n_{visit}(t)} \]

Where \(n_{visit}(t)\) is the number of administrative stays, \(n_{with\,condition}\) the number of stays having at least one claim code (e.g. ICD-10) recorded and \(t\) is the month.

If the number of visits \(n_{visit}(t)\) is equal to 0, we consider that the completeness predictor \(c(t)\) is also equal to 0.

Care site level

AREM claim data are only available at hospital level.

from edsteva.probes import ConditionProbe

condition = ConditionProbe(completeness_predictor="per_visit_default")
condition.compute(
    data,
    stay_types={
        "Hospit": "hospitalisés",
    },
    diag_types={
        "All": ".*",
        "DP/DR": "DP|DR",
    },
    condition_types={
        "All": ".*",
        "Pulmonary_embolism": "I26",
    },
    source_systems=["AREM", "ORBIS"],
)
condition.predictor.head()
care_site_level care_site_id care_site_short_name stay_type diag_type condition_type source_systems date n_visit n_visit_with_condition c
Hôpital 8312057527 Care site 1 'Hospit' 'All' 'Pulmonary_embolism' AREM 2019-05-01 233.0 196.0 0.841
Hôpital 8312057527 Care site 1 'Hospit' 'DP/DR' 'Pulmonary_embolism' AREM 2021-04-01 393.0 252.0 0.640
Hôpital 8312027648 Care site 2 'Hospit' 'All' 'Pulmonary_embolism' AREM 2017-03-01 204.0 101.0 0.497
Unité Fonctionnelle (UF) 8312027648 Care site 2 'Hospit' 'All' 'All' ORBIS 2018-08-01 22.0 6.0 0.274
Pôle/DMU 8312022130 Care site 3 'Hospit' 'DP/DR' 'Pulmonary_embolism' ORBIS 2022-02-01 9746.0 7495.0 0.769

The per_condition_default algorithm computes \(c_(t)\) the availability of claim data as follow:

\[ c(t) = \frac{n_{condition}(t)}{n_{max}} \]

Where \(n_{condition}(t)\) is the number of claim codes (e.g. ICD-10) recorded, \(t\) is the month and \(n_{max} = \max_{t}(n_{condition}(t))\).

If the maximum number of recorded diagnosis per month \(n_{max}\) is equal to 0, we consider that the completeness predictor \(c(t)\) is also equal to 0.

from edsteva.probes import ConditionProbe

condition = ConditionProbe(completeness_predictor="per_condition_default")
condition.compute(
    data,
    stay_types={
        "All": ".*",
        "Hospit": "hospitalisés",
    },
    diag_types={
        "All": ".*",
        "DP/DR": "DP|DR",
    },
    condition_types={
        "All": ".*",
        "Pulmonary_embolism": "I26",
    },
    source_systems=["AREM", "ORBIS"],
)
condition.predictor.head()
care_site_level care_site_id care_site_short_name stay_type diag_type condition_type source_systems date n_condition c
Hôpital 8312057527 Care site 1 'Hospit' 'All' 'Pulmonary_embolism' AREM 2019-05-01 233.0 0.841
Hôpital 8312057527 Care site 1 'Hospit' 'DP/DR' 'Pulmonary_embolism' AREM 2021-04-01 393.0 0.640
Hôpital 8312027648 Care site 2 'Hospit' 'All' 'Pulmonary_embolism' AREM 2017-03-01 204.0 0.497
Unité Fonctionnelle (UF) 8312027648 Care site 2 'Hospit' 'All' 'All' ORBIS 2018-08-01 22.0 0.274
Pôle/DMU 8312022130 Care site 3 'Hospit' 'DP/DR' 'Pulmonary_embolism' ORBIS 2022-02-01 9746.0 0.769

The BiologyProbe computes \(c_(t)\) the availability of laboratory data:

The per_visit_default algorithm computes \(c_(t)\) the availability of laboratory data linked to patients' administrative stays:

\[ c(t) = \frac{n_{with\,biology}(t)}{n_{visit}(t)} \]

Where \(n_{visit}(t)\) is the number of administrative stays, \(n_{with\,biology}\) the number of stays having at least one biological measurement recorded and \(t\) is the month.

If the number of visits \(n_{visit}(t)\) is equal to 0, we consider that the completeness predictor \(c(t)\) is also equal to 0.

Care site level

Laboratory data are only available at hospital level.

from edsteva.probes import BiologyProbe

biology = BiologyProbe(completeness_predictor="per_visit_default")
biology.compute(
    data,
    stay_types={
        "Hospit": "hospitalisés",
    },
    concepts_sets={
        "Créatinine": "E3180|G1974|J1002|A7813|A0094|G1975|J1172|G7834|F9409|F9410|C0697|H4038|F2621",
        "Leucocytes": "A0174|K3232|H6740|E4358|C9784|C8824|E6953",
    },
)
biology.predictor.head()
care_site_level care_site_id care_site_short_name stay_type concepts_sets date n_visit n_visit_with_measurement c
Hôpital 8312057527 Care site 1 'Hospit' 'Créatinine' 2019-05-01 233.0 196.0 0.841
Hôpital 8312057527 Care site 1 'Hospit' 'Leucocytes' 2021-04-01 393.0 252.0 0.640
Hôpital 8312027648 Care site 2 'Hospit' 'Créatinine' 2017-03-01 204.0 101.0 0.497
Hôpital 8312027648 Care site 2 'Hospit' 'Leucocytes' 2018-08-01 22.0 6.0 0.274
Hôpital 8312022130 Care site 3 'Hospit' 'Leucocytes' 2022-02-01 9746.0 7495.0 0.769

The per_measurement_default algorithm computes \(c_(t)\) the availability of biological measurements:

\[ c(t) = \frac{n_{biology}(t)}{n_{max}} \]

Where \(n_{biology}(t)\) is the number of biological measurements, \(t\) is the month and \(n_{max} = \max_{t}(n_{biology}(t))\).

If the maximum number of recorded biological measurements per month \(n_{max}\) is equal to 0, we consider that the completeness predictor \(c(t)\) is also equal to 0.

Care site level

Laboratory data are only available at hospital level.

from edsteva.probes import BiologyProbe

biology = BiologyProbe(completeness_predictor="per_measurement_default")
biology.compute(
    data,
    stay_types={
        "Hospit": "hospitalisés",
    },
    concepts_sets={
        "Créatinine": "E3180|G1974|J1002|A7813|A0094|G1975|J1172|G7834|F9409|F9410|C0697|H4038|F2621",
        "Leucocytes": "A0174|K3232|H6740|E4358|C9784|C8824|E6953",
    },
)
biology.predictor.head()
care_site_level care_site_id care_site_short_name stay_type concepts_sets date n_measurement c
Hôpital 8312057527 Care site 1 'Hospit' 'Créatinine' 2019-05-01 233.0 0.841
Hôpital 8312057527 Care site 1 'Hospit' 'Leucocytes' 2021-04-01 393.0 0.640
Hôpital 8312027648 Care site 2 'Hospit' 'Créatinine' 2017-03-01 204.0 0.497
Unité Fonctionnelle (UF) 8312027648 Care site 2 'Hospit' 'Leucocytes' 2018-08-01 22.0 0.274
Pôle/DMU 8312022130 Care site 3 'Hospit' 'Leucocytes' 2022-02-01 9746.0 0.769

The StepFunction fits a step function \(f_{t_0, c_0}(t)\) with coefficients \(\Theta = (t_0, c_0)\) on a completeness predictor \(c(t)\):

\[ \begin{aligned} f_{t_0, c_0}(t) & = c_0 \ \mathbb{1}_{t \geq t_0}(t) \\ c(t) & = f_{t_0, c_0}(t) + \epsilon(t) \end{aligned} \]
  • the characteristic time \(t_0\) estimates the time after which the data is available.
  • the characteristic value \(c_0\) estimates the stabilized routine completeness.

The default metric computed is the mean squared error after \(t_0\):

\[ error = \frac{\sum_{t_0 \leq t \leq t_{max}} \epsilon(t)^2}{t_{max} - t_0} \]
  • \(error\) estimates the stability of the data after \(t_0\).

Custom metric

You can define your own metric if this one doesn't meet your requirements.

The available algorithms used to fit the step function are listed below:

Custom algo

You can define your own algorithm if they don't meet your requirements.

This algorithm computes the estimated coefficients \(\hat{t_0}\) and \(\hat{c_0}\) by minimizing the loss function \(\mathcal{L}(t_0, c_0)\):

\[ \begin{aligned} \mathcal{L}(t_0, c_0) & = \frac{\sum_{t = t_{min}}^{t_{max}} \mathcal{l}(c(t), f_{t_0, c_0}(t))}{t_{max} - t_{min}} \\ (\hat{t_0}, \hat{c_0}) & = \underset{t_0, c_0}{\mathrm{argmin}}(\mathcal{L}(t_0, c_0)) \\ \end{aligned} \]

Default loss function \(\mathcal{l}\)

The loss function is \(l_2\) by default: $$ \mathcal{l}(c(t), f_{t_0, c_0}(t)) = |c(t) - f_{t_0, c_0}(t)|^2 $$

Optimal estimates

For complexity purposes, this algorithm has been implemented with a dependency relation between \(c_0\) and \(t_0\) derived from the optimal estimates using the \(l_2\) loss function. For more informations, you can have a look on the source code.

In this algorithm, \(\hat{c_0}\) is directly estimated as the \(x^{th}\) quantile of the completeness predictor \(c(t)\), where \(x\) is a number between 0 and 1. Then, \(\hat{t_0}\) is the first time \(c(t)\) reaches \(\hat{c_0}\).

\[ \begin{aligned} \hat{c_0} & = x^{th} \text{ quantile of } c(t) \\ \hat{t_0} & = \underset{t}{\mathrm{argmin}}(c(t) \geq \hat{c_0}) \end{aligned} \]

Default quantile \(x\)

The default quantile is \(x = 0.8\).

from edsteva.models.step_function import StepFunction

step_function_model = StepFunction()
step_function_model.fit(probe)
step_function_model.estimates.head()
care_site_level care_site_id stay_type t_0 c_0 error
Unité Fonctionnelle (UF) 8312056386 'Urg' 2019-05-01 0.397 0.040
Unité Fonctionnelle (UF) 8312056386 'All' 2017-04-01 0.583 0.028
Pôle/DMU 8312027648 'Hospit' 2021-03-01 0.677 0.022
Pôle/DMU 8312027648 'All' 2018-08-01 0.764 0.014
Hôpital 8312022130 'Hospit' 2022-02-01 0.652 0.027

The RectangleFunction fits a step function \(f_{t_0, c_0, t_1}(t)\) with coefficients \(\Theta = (t_0, c_0, t_1)\) on a completeness predictor \(c(t)\):

\[ \begin{aligned} f_{t_0, c_0, t_1}(t) & = c_0 \ \mathbb{1}_{t_0 \leq t \leq t_1}(t) \\ c(t) & = f_{t_0, c_0, t_1}(t) + \epsilon(t) \end{aligned} \]
  • the characteristic time \(t_0\) estimates the time after which the data is available.
  • the characteristic time \(t_1\) estimates the time after which the data is not available anymore.
  • the characteristic value \(c_0\) estimates the completeness between \(t_0\) and \(t_1\).

The default metric computed is the mean squared error between \(t_0\) and \(t_1\):

\[ error = \frac{\sum_{t_0 \leq t \leq t_1} \epsilon(t)^2}{t_1 - t_0} \]
  • \(error\) estimates the stability of the data between \(t_0\) and \(t_1\).

Custom metric

You can define your own metric if this one doesn't meet your requirements.

The available algorithms used to fit the step function are listed below:

Custom algo

You can define your own algorithm if they don't meet your requirements.

This algorithm computes the estimated coefficients \(\hat{t_0}\), \(\hat{c_0}\) and \(\hat{t_1}\) by minimizing the loss function \(\mathcal{L}(t_0, c_0, t_1)\):

\[ \begin{aligned} \mathcal{L}(t_0, c_0, t_1) & = \frac{\sum_{t = t_{min}}^{t_{max}} \mathcal{l}(c(t), f_{t_0, c_0, t_1}(t))}{t_{max} - t_{min}} \\ (\hat{t_0}, \hat{t_1}, \hat{c_0}) & = \underset{t_0, c_0, t_1}{\mathrm{argmin}}(\mathcal{L}(t_0, c_0, t_1)) \\ \end{aligned} \]

Default loss function \(\mathcal{l}\)

The loss function is \(l_2\) by default: $$ \mathcal{l}(c(t), f_{t_0, c_0, t_1}(t)) = |c(t) - f_{t_0, c_0, t_1}(t)|^2 $$

Optimal estimates

For complexity purposes, this algorithm has been implemented with a dependency relation between \(c_0\) and \(t_0\) derived from the optimal estimates using the \(l_2\) loss function. For more informations, you can have a look on the source code.

from edsteva.models.rectangle_function import RectangleFunction

rectangle_function_model = RectangleFunction()
rectangle_function_model.fit(probe)
rectangle_function_model.estimates.head()
care_site_level care_site_id stay_type t_0 c_0 t_1 error
Unité Fonctionnelle (UF) 8312056386 'Urg' 2019-05-01 0.397 2020-05-01 0.040
Unité Fonctionnelle (UF) 8312056386 'All' 2017-04-01 0.583 2013-04-01 0.028
Pôle/DMU 8312027648 'Hospit' 2021-03-01 0.677 2022-03-01 0.022
Pôle/DMU 8312027648 'All' 2018-08-01 0.764 2019-08-01 0.014
Hôpital 8312022130 'Hospit' 2022-02-01 0.652 2022-08-01 0.027

The library provides interactive dashboards that let you set any combination of care sites, stay types and other columns if included in the Probe. You can only export a dashboard in HTML format.

The probe_dashboard() returns:

  • On the top, the aggregated variable is the average completeness predictor \(c(t)\) over time \(t\) with the prediction \(\hat{c}(t)\) if the fitted Model is specified.
  • On the bottom, the interactive filters are all the columns included in the Probe (such as time, care site, number of visits...etc.).

from edsteva.viz.dashboards import probe_dashboard

probe_dashboard(
    probe=probe,
    fitted_model=step_function_model,
    care_site_level=care_site_level,
)
An example is available here.

The normalized_probe_dashboard() returns a representation of the overall deviation from the Model:

  • On the top, the aggregated variable is a normalized completeness predictor \(\frac{c(t)}{c_0}\) over normalized time \(t - t_0\).
  • On the bottom, the interactive filters are all the columns included in the Probe (such as time, care site, number of visits...etc.) with all the Model coefficients and metrics included in the Model.
from edsteva.viz.dashboards import normalized_probe_dashboard

normalized_probe_dashboard(
    probe=probe,
    fitted_model=step_function_model,
    care_site_level=care_site_level,
)

An example is available here.

The library provides static plots that you can export in png or svg. As it is less interactive, you may specify the filters in the inputs of the functions.

The probe_plot() returns the top plot of the probe_dashboard(): the normalized completeness predictor \(\frac{c(t)}{c_0}\) over normalized time \(t - t_0\).

from edsteva.viz.plots import probe_plot

probe_plot(
    probe=probe,
    fitted_model=step_function_model,
    care_site_level=care_site_level,
    stay_type=stay_type,
    save_path=plot_path,
)

{ "schema-url": "assets/charts/fitted_visit.json" }

The normalized_probe_plot() returns the top plot of the normalized_probe_dashboard(). Consequently, you have to specify the filters in the inputs of the function.

from edsteva.viz.plots import normalized_probe_plot

normalized_probe_plot(
    probe=probe,
    fitted_model=step_function_model,
    t_min=-15,
    t_max=15,
    save_path=plot_path,
)
{ "schema-url": "assets/charts/normalized_probe.json" }

The estimates_densities_plot() returns the density plot and the median of each estimate. It can help you to set the thresholds.

from edsteva.viz.plots import estimates_densities_plot

estimates_densities_plot(
    fitted_model=step_function_model,
)
{ "schema-url": "assets/charts/estimates_densities.json" }


  1. Samuel G Finlayson, Adarsh Subbaswamy, Karandeep Singh, John Bowers, Annabel Kupke, Jonathan Zittrain, Isaac S Kohane, and Suchi Saria. The clinician and dataset shift in artificial intelligence. The New England journal of medicine, 385(3):283, 2021.