Skip to content

Biology

The biology module of eds-scikit supports data scientists working on biological data. Its main objectives are to:

  • Extract meaningful biological parameters from biological raw data for data analysis
  • Manage outliers
  • Provide data visualization tools

Quick start

This is just a quick overview of what you can do with the biology module.

1. Load your data

First, you need to load your data. As detailed in the dedicated section, eds-scikit is expecting to work with Pandas or Koalas DataFrames. We provide various connectors to facilitate data fetching, namely a Hive connector, a Postgres connector and a Pandas connector.

Big cohort

If your cohort size is big, we highly recommend the Hive connector.

from eds_scikit.io import HiveData

db_name = "cse_xxxxxxx_xxxxxxx" # (1)
tables =  [
    "care_site",
    "concept",
    "concept_relationship",
    "measurement",
    "visit_occurrence",
]

data = HiveData(db_name, tables_to_load=tables) # (2)
  1. The data must be in OMOP format
  2. Tables are loaded as Koalas DataFrames
from eds_scikit.io import PostgresData

db_name = "cse_xxxxxxx_xxxxxxx"
schema = "my_schema"
user = "my_username"

data = PostgresData(db_name, schema=schema, user=user) # (1)
  1. This connector expects a .pgpass file storing the connection parameters
from eds_scikit.io import PandasData

folder = "my_folder_path"

data = PandasData(folder)

2. Clean the measurements

from eds_scikit.biology import bioclean

bioclean(data, start_date="2020-01-01", end_date="2021-12-31")

data.bioclean.head()
concepts_set LOINC_concept_code LOINC_concept_name AnaBio_concept_code AnaBio_concept_name transformed_unit transformed_value max_threshold min_threshold outlier value_source_value unit_source_value
EntityA_Blood_Quantitative 000-0 EntityA #Bld A0000 EntityA_Blood x10*9/l 115 190 0 False 115 x10*9/l x10*9/l
EntityA_Blood_Quantitative 000-1 EntityA_Blood_Vol A0001 EntityA_Blood_g/l x10*9/l 220 190 0 True 560 g/l g/l
EntityB_Blood_Quantitative 001-0 EntityB_Blood B0000 EntityB_Blood_artery mmol 0.45 8.548 0.542 True 0.45 mmol mmol
EntityB_Blood_Quantitative 001-0 EntityB_Blood B0001 EntityB_Blood_vein mmol 4.52 8.548 0.542 False 4.52 mmol mmol
... ... ... ... ... ... ... ... ... ... ... ...

For more details, have a look on the dedicated section.

3. Visualize statistical summary

from eds_scikit.biology import plot_biology_summary

plot_biology_summary(data)

It creates a folder with different plots for each concepts-set. For more details, have a look on the dedicated section.