Project description
Goal
eds-scikit is a tool to assist datascientists working on the AP-HP's Clinical Data Warehouse. It is specifically targeted for OMOP-standardized data to:
- Ease access and analysis of data
- Allow a better transfer of knowledge between projects
- Improve research reproduciblity
Main working principles
Dealing with various data sizes
Generally, data analysis can be done in two ways:
- Locally, by loading everything in RAM and working with e.g. Pandas
- In a distributed fashion, when dealing with a lot of data, by using e.g. Spark
While working with Pandas is often more convenient, its use can be problematic once working with large cohorts. Thus, making eds-scikit a Pandas-only library wasn't conceivable. In order to allow analysis to be conducted at scale, eds-scikit integrates with Koalas.
Koalas
Koalas is a library implementing Pandas API on top of Spark. Basically, it allows for functions and methods developped for Pandas DataFrames to work on Spark DataFrames with close to no adjustments.
Let us see a dummy example where one wants to count the number of visit occurrences per month.
Suppose we have a Spark visit_occurrence
DataFrame:
type(visit_occurrence_spark)
# Out: pyspark.sql.dataframe.DataFrame
import pyspark.sql.functions as F
def get_stats_spark(visit_occurrence):
"""
Computes the number of visits per month
Parameters
----------
visit_occurrence : DataFrame
Returns
-------
stats : pd.DataFrame
"""
# Adding a month and year column
visit_occurrence = visit_occurrence.withColumn(
"year", F.year("visit_start_datetime")
).withColumn("month", F.month("visit_start_datetime"))
# Grouping and filtering
stats = (
visit_occurrence.groupby(["year", "month"])
.count()
.filter((F.col("year") >= 2017))
.toPandas()
)
return stats
stats_from_spark = get_stats_spark(visit_occurrence_spark)
If the selected database contains few enough visits, we may have a visit_occurrence
DataFrame small enough to fit in memory as a Pandas DataFrame.
type(visit_occurrence_pandas)
# Out: pandas.core.frame.DataFrame
Then run the same analysis:
def get_stats_pandas(visit_occurrence):
"""
Computes the number of visits per month
Parameters
----------
visit_occurrence : DataFrame
Returns
-------
stats : pd.DataFrame
"""
# Adding a duration column
visit_occurrence["year"] = visit_occurrence["visit_start_datetime"].dt.year
visit_occurrence["month"] = visit_occurrence["visit_start_datetime"].dt.month
# Grouping and filtering
stats = (
visit_occurrence.groupby(["year", "month"])
.visit_occurrence_id.count()
.reset_index()
)
stats = stats[stats["year"] >= 2017]
stats.columns = ["year", "month", "count"]
return stats
stats_from_pandas = get_stats_pandas(visit_occurrence_pandas)
The two examples above clearly show the syntax differences between using Pandas and using Spark.
In order for a library to work both with Pandas and Spark, one would need to developp each function twice to accomodate for those two frameworks. Another problem might occur if you are dealing with a huge cohort, forcing you to do your final analysis in a distributed manner via Spark. In that scenario, you coudn't test your code on a small Pandas DataFrame subset.
The goal of Koalas is precisely to avoid this issue. It aims at allowing code to be written for Pandas DataFrames, and also run with (almost) no adjustements with Spark DataFrame:
from databricks import koalas as ks
# Converting the Spark DataFrame into a Koalas DataFrame
visit_occurrence_koalas = visit_occurrence_spark.to_koalas()
Info
The code above allows the DataFrame to stay distributed —as opposed to applying the .toPandas()
method.
We can now use the function we designed for Pandas with a Koalas DataFrame:
stats_from_koalas = get_stats_pandas(visit_occurrence_koalas)
Since we aggregated the data, its size is manageable so we can convert it back to Pandas for e.g. plotting
stats_from_koalas = stats_from_koalas.to_pandas()
Concept
Most functions developped in the library implements a concept. For sake of clarity let us illustrate this notion with an example:
The function tag_icu_care_site()
can be used to tag a care site as being an ICU or not. We say that it implements the concept "IS_ICU"
because it adds a column named "IS_ICU"
to the input DataFrame, as it can be seen from the docstring:
"""
Returns
-------
care_site: DataFrame
Dataframe with 1 added column corresponding to the following concept:
- 'IS_ICU'
"""
concept
column, and the corresponding ICD-10 code in a value
column:
"""
Returns
-------
DataFrame
Event DataFrame in **long** format (with a `concept` and a `value` column).
The `concept` column contains one of the following:
- DIABETES_TYPE_I
- DIABETES_TYPE_II
- DIABETES_MALNUTRITION
- DIABETES_IN_PREGNANCY
- OTHER_DIABETES_MELLITUS
- DIABETES_INSIPIDUS
The `value` column contains the corresponding ICD-10 code that was extracted
"""
Question
Check this link for a (very) quick explanation if you aren't familiar with Long vs Wide data format.
Algo
Most functions also have an argument called algo
, which allows you to choose how a specific concept will be implemented in a function. Let's check the docstring of the same function tag_icu_care_site()
:
"""
Parameters
----------
care_site: DataFrame
algo: str
Possible values are:
- `"from_authorisation_type"`
- `"from_regex_on_care_site_description"`
"""
"from_authorisation_type"
is the default algo
, used if the algo
argument isn't filled by the user.
In the documentation, the different "algo"
values will be displayed as tabs, along with a short description and optional algo-dependant parameters:
Availables algorithms (values for "algo"
)
This "algo"
is used by default.
It does yadi yada.
Specific parameters:
- This first parameter
- This second parameter
- And also this third one
This second "algo"
works differently.
It has no additional parameters
Please check the available algos
when using a function from eds-scikit, to understand what each of them is doing and which one might fits you best.