Skip to content

Probe

Choosing or customizing a Probe is the second step in the EDS-TeVa usage workflow.

Definition

A Probe is a python class designed to characterize data availability of a target variable over time \(t\). It aggregates the loaded data to obtain a completeness predictor \(c(t)\).

Image title

Probe class diagram

Input

As detailled in the dedicated section, the Probe class is expecting a Data object with Pandas or Koalas DataFrames. We provide various connectors to facilitate data fetching, namely a Hive connector, a Postgres connector and a LocalData.

Attributes

  • predictor is a Pandas.DataFrame computed by the compute() method. It contains the desired completeness predictor \(c(t)\) for each column in the _index attribute (care site, stay type and any other needed column).
  • _index is the list of columns that are used to aggregate the data in the compute() method.

Methods

  • compute() method calls the compute_process() method to compute the completeness predictors \(c(t)\) and store them in the predictor attribute.
  • compute_process() method aggregates the input data to compute the completeness predictors \(c(t)\).
  • filter_care_site() method filters predictor attribute on the selected care sites including upper and lower levels care sites.
  • save() method saves the Probe in the desired path. By default it is saved in the the cache directory (~/.cache/edsteva/probes).
  • load() method loads the Probe from the desired path. By default it is loaded from the the cache directory (~/.cache/edsteva/probes).

Predictor schema

Data stored in predictor attribute follows a specific schema:

Predictors

It must include a completeness predictor \(c(t)\):

  • c: value of the completeness predictor \(c(t)\).

Then, it can have any other extra predictor you find useful such as:

  • n_visit: the number of visits.

Extra predictor

The extra predictors must be additive to be aggregated properly in the dashboards. For instance, the number of visits is additive but the \(99^{th}\) percentile is not.

Indexes

It must include one and only one time related column:

  • date: date of the event associated with the target variable (by default, the dates are truncated to the month in which the event occurs).

Then, it can have any other string type column such as:

  • care_site_level: care site hierarchic level (uf, pole, hospital).
  • care_site_id: care site unique identifier.
  • stay_type: type of stay (hospitalisés, urgence, hospitalisation incomplète, consultation externe).
  • note_type: type of note (CRH, Ordonnance, CR Passage Urgences).

Example

When considering the availability of clinical notes, a NoteProbe.predictor may for instance look like this:

care_site_level care_site_id care_site_short_name stay_type note_type date n_visit c
Unité Fonctionnelle (UF) 8312056386 Care site 1 'Urg_Hospit' 'All' 2019-05-01 233.0 '0.841
Unité Fonctionnelle (UF) 8653815660 Care site 1 'All' 'CRH' 2011-04-01 393.0 0.640
Pôle/DMU 8312027648 Care site 2 'Urg_Hospit' 'CRH' 2021-03-01 204.0 0.497
Pôle/DMU 8312056379 Care site 2 'All' 'Ordonnance' 2018-08-01 22.0 0.274
Hôpital 8312022130 Care site 3 'Urg_Hospit' 'CR Passage Urgences' 2022-02-01 9746.0 0.769

Saving and loading a computed Probe

In order to ease the future loading of a Probe that has been computed with the compute() method, one can pickle it using the save() method. This enables a rapid loading of the Probe from local disk using the load() method.

from edsteva.probes import NoteProbe

note = NoteProbe()

note.compute(data)  # (1)
note.save()  # (2)

note_2 = NoteProbe()
note_2.load()  # (3)
  1. Computation of the Probe querying the database (long).
  2. Saving of the Probe on the local disk.
  3. Rapid loading of the Probe fom the local disk.

Defining a custom Probe

If none of the available Probes meets your requirements, you may want to create your own. To define a custom Probe class CustomProbe that inherits from the abstract class BaseProbe you'll have to implement the compute_process() method (this method is natively called by the compute() method inherited by the BaseProbe class). You'll also have to define the _index attribute which is the list of columns that are used to aggregate the data in the compute_process() method.

from edsteva.probes import BaseProbe


# Definition of a new Probe class
class CustomProbe(BaseProbe):
    def __init__(
        self,
    ):
        self._index = ["my_custom_column_1", "my_custom_column_2"]
        super().__init__(
            index=self._index,
        )

    def compute_process(
        self,
        data: Data,
        **kwargs,
    ):
        # query using Pandas API
        return custom_predictor

compute_process() can take as much as argument as you need but it must include a data argument and must return a Pandas.DataFrame which contains at least the columns of the standard schema of a predictor. For a detailed example of the implementation of a Probe, please have a look on the implemented Probes such as VisitProbe or NoteProbe.

Contributions

If you managed to create your own Probe do not hesitate to share it with the community by following the contribution guidelines. Contributions are welcome, and they are greatly appreciated! Every little bit helps, and credit will always be given.

Available Probes

We list hereafter the Probes that have already been implemented in the library.

The VisitProbe computes \(c_{visit}(t)\) the availability of administrative stays:

\[ c(t) = \frac{n_{visit}(t)}{n_{max}} \]

Where \(n_{visit}(t)\) is the number of administrative stays, \(t\) is the month and \(n_{max} = \max_{t}(n_{visit}(t))\).

If the maximum number of records per month \(n_{max}\) is equal to 0, we consider that the completeness predictor \(c(t)\) is also equal to 0.

from edsteva.probes import VisitProbe

visit = VisitProbe()
visit.compute(
    data,
    stay_types={
        "Urg": "urgence",
        "Hospit": "hospitalisés",
        "Urg_Hospit": "urgence|hospitalisés",
    },
)
visit.predictor.head()
care_site_level care_site_id care_site_short_name stay_type date n_visit c
Unité Fonctionnelle (UF) 8312056386 Care site 1 'Urg' 2019-05-01 233.0 0.841
Unité Fonctionnelle (UF) 8312056386 Care site 1 'Urg' 2021-04-01 393.0 0.640
Pôle/DMU 8312027648 Care site 2 'Hospit' 2011-03-01 204.0 0.497
Pôle/DMU 8312027648 Care site 2 'Urg' 2018-08-01 22.0 0.274
Hôpital 8312022130 Care site 3 'Urg_Hospit' 2022-02-01 9746.0 0.769

The NoteProbe computes \(c_{note}(t)\) the availability of clinical documents:

The per_visit_default algorithm computes \(c_(t)\) the availability of clinical documents linked to patients' administrative stays:

\[ c(t) = \frac{n_{with\,doc}(t)}{n_{visit}(t)} \]

Where \(n_{visit}(t)\) is the number of administrative stays, \(n_{with\,doc}\) the number of visits having at least one document and \(t\) is the month.

If the number of visits \(n_{visit}(t)\) is equal to 0, we consider that the completeness predictor \(c(t)\) is also equal to 0.

from edsteva.probes import NoteProbe

note = Note(completeness_predictor="per_visit_default")
note.compute(
    data,
    stay_types={
        "Urg": "urgence",
        "Hospit": "hospitalisés",
        "Urg_Hospit": "urgence|hospitalisés",
    },
    note_types={
        "All": ".*",
        "CRH": "crh",
        "Ordonnance": "ordo",
        "CR Passage Urgences": "urge",
    },
)
note.predictor.head()
care_site_level care_site_id care_site_short_name stay_type note_type date n_visit n_visit_with_note c
Unité Fonctionnelle (UF) 8312056386 Care site 1 'Urg' 'All' 2019-05-01 233.0 196.0 0.841
Unité Fonctionnelle (UF) 8653815660 Care site 1 'Hospit' 'CRH' 2011-04-01 393.0 252.0 0.640
Pôle/DMU 8312027648 Care site 2 'Hospit' 'CRH' 2021-03-01 204.0 101.0 0.497
Pôle/DMU 8312056379 Care site 2 'Urg' 'Ordonnance' 2018-08-01 22.0 6.0 0.274
Hôpital 8312022130 Care site 3 'Urg_Hospit' 'CR Passage Urgences' 2022-02-01 9746.0 7495.0 0.769

The per_note_default algorithm computes \(c_(t)\) the availability of clinical documents as follow:

\[ c(t) = \frac{n_{note}(t)}{n_{max}} \]

Where \(n_{note}(t)\) is the number of clinical documents, \(t\) is the month and \(n_{max} = \max_{t}(n_{note}(t))\).

If the maximum number of recorded notes per month \(n_{max}\) is equal to 0, we consider that the completeness predictor \(c(t)\) is also equal to 0.

from edsteva.probes import NoteProbe

note = Note(completeness_predictor="per_note_default")
note.compute(
    data,
    stay_types={
        "Urg": "urgence",
        "Hospit": "hospitalisés",
        "Urg_Hospit": "urgence|hospitalisés",
    },
    note_types={
        "All": ".*",
        "CRH": "crh",
        "Ordonnance": "ordo",
        "CR Passage Urgences": "urge",
    },
)
note.predictor.head()
care_site_level care_site_id care_site_short_name stay_type note_type date n_note c
Unité Fonctionnelle (UF) 8312056386 Care site 1 'Urg' 'All' 2019-05-01 233.0 0.841
Unité Fonctionnelle (UF) 8653815660 Care site 1 'Hospit' 'CRH' 2011-04-01 393.0 0.640
Pôle/DMU 8312027648 Care site 2 'Hospit' 'CRH' 2021-03-01 204.0 0.497
Pôle/DMU 8312056379 Care site 2 'Urg' 'Ordonnance' 2018-08-01 22.0 0.274
Hôpital 8312022130 Care site 3 'Urg_Hospit' 'CR Passage Urgences' 2022-02-01 9746.0 0.769

The ConditionProbe computes \(c_{condition}(t)\) the availability of claim data:

The per_visit_default algorithm computes \(c_(t)\) the availability of claim data linked to patients' administrative stays:

\[ c(t) = \frac{n_{with\,condition}(t)}{n_{visit}(t)} \]

Where \(n_{visit}(t)\) is the number of administrative stays, \(n_{with\,condition}\) the number of stays having at least one claim code (e.g. ICD-10) recorded and \(t\) is the month.

If the number of visits \(n_{visit}(t)\) is equal to 0, we consider that the completeness predictor \(c(t)\) is also equal to 0.

Care site level

AREM claim data are only available at hospital level.

from edsteva.probes import ConditionProbe

condition = ConditionProbe(completeness_predictor="per_visit_default")
condition.compute(
    data,
    stay_types={
        "Hospit": "hospitalisés",
    },
    diag_types={
        "All": ".*",
        "DP/DR": "DP|DR",
    },
    condition_types={
        "All": ".*",
        "Pulmonary_embolism": "I26",
    },
    source_systems=["AREM", "ORBIS"],
)
condition.predictor.head()
care_site_level care_site_id care_site_short_name stay_type diag_type condition_type source_systems date n_visit n_visit_with_condition c
Hôpital 8312057527 Care site 1 'Hospit' 'All' 'Pulmonary_embolism' AREM 2019-05-01 233.0 196.0 0.841
Hôpital 8312057527 Care site 1 'Hospit' 'DP/DR' 'Pulmonary_embolism' AREM 2021-04-01 393.0 252.0 0.640
Hôpital 8312027648 Care site 2 'Hospit' 'All' 'Pulmonary_embolism' AREM 2011-03-01 204.0 101.0 0.497
Unité Fonctionnelle (UF) 8312027648 Care site 2 'Hospit' 'All' 'All' ORBIS 2018-08-01 22.0 6.0 0.274
Pôle/DMU 8312022130 Care site 3 'Hospit' 'DP/DR' 'Pulmonary_embolism' ORBIS 2022-02-01 9746.0 7495.0 0.769

The per_condition_default algorithm computes \(c_(t)\) the availability of claim data as follow:

\[ c(t) = \frac{n_{condition}(t)}{n_{max}} \]

Where \(n_{condition}(t)\) is the number of claim codes (e.g. ICD-10) recorded, \(t\) is the month and \(n_{max} = \max_{t}(n_{condition}(t))\).

If the maximum number of recorded diagnosis per month \(n_{max}\) is equal to 0, we consider that the completeness predictor \(c(t)\) is also equal to 0.

from edsteva.probes import ConditionProbe

condition = ConditionProbe(completeness_predictor="per_condition_default")
condition.compute(
    data,
    stay_types={
        "All": ".*",
        "Hospit": "hospitalisés",
    },
    diag_types={
        "All": ".*",
        "DP/DR": "DP|DR",
    },
    condition_types={
        "All": ".*",
        "Pulmonary_embolism": "I26",
    },
    source_systems=["AREM", "ORBIS"],
)
condition.predictor.head()
care_site_level care_site_id care_site_short_name stay_type diag_type condition_type source_systems date n_condition c
Hôpital 8312057527 Care site 1 'Hospit' 'All' 'Pulmonary_embolism' AREM 2019-05-01 233.0 0.841
Hôpital 8312057527 Care site 1 'Hospit' 'DP/DR' 'Pulmonary_embolism' AREM 2021-04-01 393.0 0.640
Hôpital 8312027648 Care site 2 'Hospit' 'All' 'Pulmonary_embolism' AREM 2011-03-01 204.0 0.497
Unité Fonctionnelle (UF) 8312027648 Care site 2 'Hospit' 'All' 'All' ORBIS 2018-08-01 22.0 0.274
Pôle/DMU 8312022130 Care site 3 'Hospit' 'DP/DR' 'Pulmonary_embolism' ORBIS 2022-02-01 9746.0 0.769

The BiologyProbe computes \(c_(t)\) the availability of laboratory data:

The per_visit_default algorithm computes \(c_(t)\) the availability of laboratory data linked to patients' administrative stays:

\[ c(t) = \frac{n_{with\,biology}(t)}{n_{visit}(t)} \]

Where \(n_{visit}(t)\) is the number of administrative stays, \(n_{with\,biology}\) the number of stays having at least one biological measurement recorded and \(t\) is the month.

If the number of visits \(n_{visit}(t)\) is equal to 0, we consider that the completeness predictor \(c(t)\) is also equal to 0.

Care site level

Laboratory data are only available at hospital level.

from edsteva.probes import BiologyProbe

biology = BiologyProbe(completeness_predictor="per_visit_default")
biology.compute(
    data,
    stay_types={
        "Hospit": "hospitalisés",
    },
    concepts_sets={
        "Créatinine": "E3180|G1974|J1002|A7813|A0094|G1975|J1172|G7834|F9409|F9410|C0697|H4038|F2621",
        "Leucocytes": "A0174|K3232|H6740|E4358|C9784|C8824|E6953",
    },
)
biology.predictor.head()
care_site_level care_site_id care_site_short_name stay_type concepts_sets date n_visit n_visit_with_measurement c
Hôpital 8312057527 Care site 1 'Hospit' 'Créatinine' 2019-05-01 233.0 196.0 0.841
Hôpital 8312057527 Care site 1 'Hospit' 'Leucocytes' 2021-04-01 393.0 252.0 0.640
Hôpital 8312027648 Care site 2 'Hospit' 'Créatinine' 2011-03-01 204.0 101.0 0.497
Hôpital 8312027648 Care site 2 'Hospit' 'Leucocytes' 2018-08-01 22.0 6.0 0.274
Hôpital 8312022130 Care site 3 'Hospit' 'Leucocytes' 2022-02-01 9746.0 7495.0 0.769

The per_measurement_default algorithm computes \(c_(t)\) the availability of biological measurements:

\[ c(t) = \frac{n_{biology}(t)}{n_{max}} \]

Where \(n_{biology}(t)\) is the number of biological measurements, \(t\) is the month and \(n_{max} = \max_{t}(n_{biology}(t))\).

If the maximum number of recorded biological measurements per month \(n_{max}\) is equal to 0, we consider that the completeness predictor \(c(t)\) is also equal to 0.

Care site level

Laboratory data are only available at hospital level.

from edsteva.probes import BiologyProbe

biology = BiologyProbe(completeness_predictor="per_measurement_default")
biology.compute(
    data,
    stay_types={
        "Hospit": "hospitalisés",
    },
    concepts_sets={
        "Créatinine": "E3180|G1974|J1002|A7813|A0094|G1975|J1172|G7834|F9409|F9410|C0697|H4038|F2621",
        "Leucocytes": "A0174|K3232|H6740|E4358|C9784|C8824|E6953",
    },
)
biology.predictor.head()
care_site_level care_site_id care_site_short_name stay_type concepts_sets date n_measurement c
Hôpital 8312057527 Care site 1 'Hospit' 'Créatinine' 2019-05-01 233.0 0.841
Hôpital 8312057527 Care site 1 'Hospit' 'Leucocytes' 2021-04-01 393.0 0.640
Hôpital 8312027648 Care site 2 'Hospit' 'Créatinine' 2011-03-01 204.0 0.497
Unité Fonctionnelle (UF) 8312027648 Care site 2 'Hospit' 'Leucocytes' 2018-08-01 22.0 0.274
Pôle/DMU 8312022130 Care site 3 'Hospit' 'Leucocytes' 2022-02-01 9746.0 0.769