Getting started
eds-scikit is a tool to assist data scientists working on the AP-HP’s Clinical Data Warehouse. It is specifically targeted for OMOP-standardized data to:
-
Ease access and analysis of data
-
Allow a better transfer of knowledge between projects
-
Improve research reproduciblity
As an example, the following figure was obtained using various functionalities from eds-scikit.
How was it done ?
Click on the figure above to jump to the tutorial using various functionalities from eds-scikit, or continue reading the introduction!
Using eds-scikit
with I2B2
Although designed for OMOP databases, eds-scikit
provides a connector for I2B2 databases is available. We don't guarantee its exhaustivity, but it should allow you to use functionnalities of the library seamlessly.
Quick start
Installation
Requirements
eds-scikit stands on the shoulders of Spark 2.4 which runs on Java 8 and Python ~3.7.1. If you work on AP-HP's CDW, those requirements are already fulfilled, so please disregard the following steps. Else, it is essential to:
- Install a version of Python ≥ 3.7.1 and < 3.8.
-
Install OpenJDK 8, an open-source reference implementation of Java 8 wit the following command lines:
$ sudo apt-get update $ sudo apt-get install openjdk-8-jdk ---> 100%
For more details, check this installation guide
$ brew tap AdoptOpenJDK/openjdk $ brew install --cask adoptopenjdk8 ---> 100%
For more details, check this installation guide
Follow this installation guide
You can install eds-scikit via pip:
$ pip install eds-scikit
---> 100%
color:green Successfully installed eds_scikit !
Possible issue with pip
If you get an an error during installation, please try downgrading pip via pip install -U "pip<23"
before install eds-scikit
Improving performances on distributed data
It is highly recommanded (but not mandatory) to use the helper function eds_scikit.improve_performances
to optimaly configure PySpark and Koalas. You can simply call
import eds_scikit
spark, sc, sql = eds_scikit.improve_performances()
- A
SparkSession
- A
SparkContext
- An
sql
function to execute SQL queries
A first example: Merging visits together
Let's tackle a common problem when dealing with clinical data: Merging close/consecutive visits into stays. As detailled in the dedicated section, eds-scikit is expecting to work with Pandas or Koalas DataFrames. We provide various connectors to facilitate data fetching, namely a Hive connector and a Postgres connector
from eds_scikit.io import HiveData
data = HiveData(DB_NAME)
visit_occurrence = data.visit_occurrence # (1)
- With this connector,
visit_occurrence
will be a Pandas DataFrame
I2B2
If DB_NAME
points to an I2B2 database, use data = HiveData(DB_NAME, database_type="I2B2")
from eds_scikit.io import PostgresData
DB_NAME = "my_db"
SCHEMA = "my_schema"
USER = "my_username"
data = PostgresData(DB_NAME, schema=SCHEMA, user=USER) # (1)
visit_occurrence = data.visit_occurrence # (2)
- This connector expects a
.pgpass
file storing the connection parameters - With this connector,
visit_occurrence
will be a Pandas DataFrame
- It follows the OMOP format
- It is a Pandas or Koalas DataFrame
import pandas as pd
visit_occurrence = pd.read_csv("./data/visit_occurrence.csv")
visit_occurrence
For the sake of the example, only columns of interest are shown here.
visit_occurrence_id | person_id | visit_start_datetime | visit_end_datetime | visit_source_value | row_status_source_value | care_site_id | |
---|---|---|---|---|---|---|---|
0 | A | 999 | 2021-01-01 00:00:00 | 2021-01-05 00:00:00 | hospitalisés | courant | 1 |
1 | B | 999 | 2021-01-04 00:00:00 | 2021-01-08 00:00:00 | hospitalisés | courant | 1 |
2 | C | 999 | 2021-01-12 00:00:00 | 2021-01-18 00:00:00 | hospitalisés | courant | 1 |
3 | D | 999 | 2021-01-13 00:00:00 | 2021-01-14 00:00:00 | urgence | courant | 1 |
4 | E | 999 | 2021-01-19 00:00:00 | 2021-01-21 00:00:00 | hospitalisés | courant | 2 |
5 | F | 999 | 2021-01-25 00:00:00 | 2021-01-27 00:00:00 | hospitalisés | supprimé | 1 |
... | ... | ... | ... | ... | ... | ... | ... |
# Importing the desired functions:
from eds_scikit.period.stays import merge_visits, get_stays_duration
# Calling the first function: computing stays
visit_occurrence = merge_visits(visit_occurrence)
As you can see, the function added a STAY_ID
concept, grouping visits together
visit_occurrence[["visit_occurrence_id","STAY_ID"]]
visit_occurrence_id | STAY_ID | |
---|---|---|
0 | A | A |
1 | B | A |
2 | C | C |
3 | D | C |
4 | E | E |
5 | F | F |
... | ... | ... |
# Calling the second function: computing stays duration
stays = get_stays_duration(visit_occurrence, missing_end_date_handling="coerce")
Here, each stay duration was calculated, dealing with potential overlaps and inclusions.:
stays
STAY_ID | t_start | t_end | STAY_DURATION |
---|---|---|---|
A | 2021-01-01 00:00:00 | 2021-01-08 00:00:00 | 168 |
C | 2021-01-12 00:00:00 | 2021-01-18 00:00:00 | 144 |
E | 2021-01-19 00:00:00 | 2021-01-21 00:00:00 | 48 |
F | 2021-01-25 00:00:00 | 2021-01-27 00:00:00 | 48 |
... | ... | ... | ... |
About the code above
As you noticed, the pipeline above is fairly straightforward, needing only the visit_occurrence
DataFrame as input.
However, it is also highly customizable, and you should always look into all the various availables options for the functions you're using. For instance, the following parameters could have been used:
visit_occurrence = merge_visits(
visit_occurrence,
remove_deleted_visits=True,
long_stay_threshold=timedelta(days=365),
long_stay_filtering="all",
max_timedelta=timedelta(hours=24),
merge_different_hospitals=False,
merge_different_source_values=["hospitalisés", "urgence"],
)
stays = get_stays_duration(
visit_occurrence, algo="sum_of_visits_duration", missing_end_date_handling="coerce"
)
A word about AP-HP
Specifics of AP-HP CDW
eds-scikit was developped by AP-HP's Data Science team with the help of Inria's Soda team. As such, it is especially well fitted for AP-HP's Data Warehouse. In this doc, we use the following card to mention information that might be useful when using eds-scikit with AP-HP's data:
Some information
Here, we might for instance suggest some parameters for a function that should be used given AP-HP's data.
EDS-NLP
Also, a rule-based NLP library (EDS-NLP) designed to work on clinical texts was developped in parallel with eds-scikit. We decided not to include EDS-NLP as a dependency. Still, some functions might require an input à la note_nlp
: For instance, the current function designed to extract consultation dates from a visit_occurrence
car work either on structured data only or with dates extracted in text and compiled in a DataFrame.
You are free to use the method of your choice to get this DataFrame, as long as it contains the necessary columns as mentionned in the documentation. Note that we mention with the following card the availability of an EDS-NLP dedicated pipeline:
A dedicated pipe
For the example above, a consultation date pipeline exists. Moreover, methods are available to run an EDS-NLP pipeline on a Pandas, Spark or even Koalas DataFrame !
Contributing to eds-scikit
We welcome contributions! Fork the project and create a pull request. Take a look at the dedicated page for details.
Citation
If you use eds-scikit
, please cite us as below.
@misc{eds-scikit,
author = {Petit-Jean, Thomas and Remaki, Adam and Maladière, Vincent and Varoquaux, Gaël and Bey, Romain},
doi = {10.5281/zenodo.7401549},
title = {eds-scikit: data analysis on OMOP databases},
url = {https://github.com/aphp/eds-scikit}
}