Skip to content

eds_scikit.event.from_code

event_from_code

event_from_code(df: DataFrame, columns: Dict[str, str], visit_occurrence: Optional[DataFrame] = None, concept: str = 'ICD10', codes: Optional[Dict[str, Union[str, List[str]]]] = None, date_from_visit: bool = True, additional_filtering: Dict[str, Any] = dict(), date_min: Optional[datetime] = None, date_max: Optional[datetime] = None) -> DataFrame

Generic function to filter a DataFrame based on one of its column and an ensemble of codes to select from.

For instance, this function is called when phenotyping via ICD-10 or CCAM.

PARAMETER DESCRIPTION
df

The DataFrame to filter.

TYPE: DataFrame

columns

Dictionary with the following keys:

  • code_source_value : The column name containing the code to filter
  • code_start_datetime : The column name containing the starting date
  • code_end_datetime : The column name containing the ending date

TYPE: Dict[str, str]

visit_occurrence

The visit_occurrence DataFrame, only necessary if date_from_visit is set to True.

TYPE: Optional[DataFrame] DEFAULT: None

concept

The name of the extracted concept

TYPE: str DEFAULT: 'ICD10'

codes

Dictionary which values are codes (as a unique string or as a list) and which keys are at least one of the following:

  • exact: To match the codes in codes["exact"] exactly
  • prefix: To match the codes in codes["prefix"] as prefixes
  • regex: To match the codes in codes["regex"] as regexes You can combine any of those keys.

TYPE: Dict[str, Union[str, List[str]]] DEFAULT: None

date_from_visit

If set to True, uses visit_start_datetime as the code datetime

TYPE: bool DEFAULT: True

additional_filtering

An optional dictionary to filter the resulting DataFrame. Keys should be column names on which too filter, and values should be either

  • A single value
  • A list or set of values.

TYPE: Dict[str, Any] DEFAULT: dict()

date_min

The minimum code datetime to keep. Depends on the date_from_visit flag

TYPE: Optional[datetime] DEFAULT: None

date_max

The minimum code datetime to keep. Depends on the date_from_visit flag

TYPE: Optional[datetime] DEFAULT: None

RETURNS DESCRIPTION
DataFrame

A DataFrame containing especially the following columns:

  • t_start
  • t_end
  • concept : The provided concept string
  • value : The matched code
Source code in eds_scikit/event/from_code.py
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
def event_from_code(
    df: DataFrame,
    columns: Dict[str, str],
    visit_occurrence: Optional[DataFrame] = None,
    concept: str = "ICD10",
    codes: Optional[Dict[str, Union[str, List[str]]]] = None,
    date_from_visit: bool = True,
    additional_filtering: Dict[str, Any] = dict(),
    date_min: Optional[datetime] = None,
    date_max: Optional[datetime] = None,
) -> DataFrame:
    """
    Generic function to filter a DataFrame based on one of its column and an ensemble of codes to select from.

    For instance, this function is called when phenotyping via ICD-10 or CCAM.

    Parameters
    ----------
    df : DataFrame
        The DataFrame to filter.
    columns : Dict[str, str]
        Dictionary with the following keys:

        - `code_source_value` : The column name containing the code to filter
        - `code_start_datetime` : The column name containing the starting date
        - `code_end_datetime` : The column name containing the ending date
    visit_occurrence : Optional[DataFrame]
        The `visit_occurrence` DataFrame, only necessary if `date_from_visit` is set to `True`.
    concept : str
        The name of the extracted concept
    codes : Dict[str, Union[str, List[str]]]
        Dictionary which values are codes (as a unique string or as a list) and which keys are
        at least one of the following:

        - `exact`: To match the codes in `codes["exact"]` **exactly**
        - `prefix`: To match the codes in `codes["prefix"]` **as prefixes**
        - `regex`: To match the codes in `codes["regex"]` **as regexes**
        You can combine any of those keys.
    date_from_visit : bool
        If set to `True`, uses `visit_start_datetime` as the code datetime
    additional_filtering : Dict[str, Any]
        An optional dictionary to filter the resulting DataFrame.
        Keys should be column names on which too filter, and values should be either

        - A single value
        - A list or set of values.
    date_min : Optional[datetime]
        The minimum code datetime to keep. **Depends on the `date_from_visit` flag**
    date_max : Optional[datetime]
        The minimum code datetime to keep. **Depends on the `date_from_visit` flag**

    Returns
    -------
    DataFrame
        A DataFrame containing especially the following columns:

        - `t_start`
        - `t_end`
        - `concept` : The provided `concept` string
        - `value` : The matched code

    """

    required_columns = list(columns.values()) + ["visit_occurrence_id", "person_id"]
    check_columns(df, required_columns=required_columns)

    d_format = {"exact": r"{code}\b", "regex": r"{code}", "prefix": r"\b{code}"}
    regexes = []

    for code_type, code_list in codes.items():

        if type(code_list) == str:
            code_list = [code_list]
        codes_formated = [d_format[code_type].format(code=code) for code in code_list]
        regexes.append(r"(?:" + "|".join(codes_formated) + ")")

    final_regex = "|".join(regexes)

    mask = df[columns["code_source_value"]].str.contains(final_regex).fillna(False)

    event = df[mask]

    if date_from_visit:
        if visit_occurrence is None:
            raise ValueError(
                "With 'date_from_visit=True', you should provide a 'visit_occurrence' DataFrame."
            )
        event = event.merge(
            visit_occurrence[
                ["visit_occurrence_id", "visit_start_datetime", "visit_end_datetime"]
            ],
            on="visit_occurrence_id",
            how="inner",
        ).rename(
            columns={
                "visit_start_datetime": "t_start",
                "visit_end_datetime": "t_end",
            }
        )

    else:
        event.loc[:, "t_start"] = event.loc[:, columns["code_start_datetime"]]
        event.loc[:, "t_end"] = event.loc[:, columns["code_end_datetime"]]
        event = event.drop(
            columns=[columns["code_start_datetime"], columns["code_end_datetime"]]
        )

    event = _column_filtering(event, filtering_dict=additional_filtering)

    mask = True  # Resetting the mask

    if date_min is not None:
        mask = mask & (event.t_start >= date_min)

    if date_max is not None:
        mask = mask & (event.t_start <= date_max)

    if type(mask) != bool:  # We have a Series mask
        event = event[mask]

    event.loc[:, "concept"] = concept
    return event.rename(columns={columns["code_source_value"]: "value"})[
        [
            "person_id",
            "t_start",
            "t_end",
            "concept",
            "value",
            "visit_occurrence_id",
        ]
        + list(additional_filtering.keys())
    ].reset_index(drop=True)