Skip to content

edsnlp.processing.wrapper

pipe(note, nlp, how='parallel', additional_spans='discarded', extensions=[], **kwargs)

Function to apply a spaCy pipe to a pandas or pyspark DataFrame

PARAMETER DESCRIPTION
note

A pandas DataFrame with a note_id and note_text column

TYPE: DataFrame

nlp

A spaCy pipe

TYPE: Language

how

3 methods are available here:

  • how='simple': Single process on a pandas DataFrame
  • how='parallel': Parallelised processes on a pandas DataFrame
  • how='spark': Distributed processes on a pyspark DataFrame

TYPE: str, by default "parallel" DEFAULT: 'parallel'

additional_spans

A name (or list of names) of SpanGroup on which to apply the pipe too: SpanGroup are available as doc.spans[spangroup_name] and can be generated by some pipes. For instance, the date pipe populates doc.spans['dates']

TYPE: Union[List[str], str], by default "discarded" DEFAULT: 'discarded'

extensions

Spans extensions to add to the extracted results: For instance, if extensions=["score_name"], the extracted result will include, for each entity, ent._.score_name.

TYPE: List[Tuple[str, T.DataType]], by default [] DEFAULT: []

kwargs

Additional parameters depending on the how argument.

TYPE: Dict[str, Any]

RETURNS DESCRIPTION
Union[pd.DataFrame, ps.DataFrame]

A DataFrame with one line per extraction

Source code in edsnlp/processing/wrapper.py
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
def pipe(
    note: Union[pd.DataFrame, ps.DataFrame],
    nlp: Language,
    how: str = "parallel",
    additional_spans: Union[List[str], str] = "discarded",
    extensions: ExtensionSchema = [],
    **kwargs: Dict[str, Any],
) -> Union[pd.DataFrame, ps.DataFrame]:
    """
    Function to apply a spaCy pipe to a pandas or pyspark DataFrame


    Parameters
    ----------
    note : DataFrame
        A pandas DataFrame with a `note_id` and `note_text` column
    nlp : Language
        A spaCy pipe
    how : str, by default "parallel"
        3 methods are available here:

        - `how='simple'`: Single process on a pandas DataFrame
        - `how='parallel'`: Parallelised processes on a pandas DataFrame
        - `how='spark'`: Distributed processes on a pyspark DataFrame
    additional_spans : Union[List[str], str], by default "discarded"

        A name (or list of names) of SpanGroup on which to apply the pipe too:
        SpanGroup are available as `doc.spans[spangroup_name]` and can be generated
        by some pipes. For instance, the `date` pipe populates doc.spans['dates']
    extensions : List[Tuple[str, T.DataType]], by default []
        Spans extensions to add to the extracted results:
        For instance, if `extensions=["score_name"]`, the extracted result
        will include, for each entity, `ent._.score_name`.
    kwargs : Dict[str, Any]
        Additional parameters depending on the `how` argument.

    Returns
    -------
    Union[pd.DataFrame, ps.DataFrame]
        A DataFrame with one line per extraction
    """

    if (type(note) == ps.DataFrame) and (how != "spark"):
        raise ValueError(
            "You are providing a pyspark DataFrame, please use `how='spark'`"
        )
    if how == "simple":

        return simple_pipe(
            note=note,
            nlp=nlp,
            additional_spans=additional_spans,
            extensions=extensions,
            **kwargs,
        )

    if how == "parallel":

        return parallel_pipe(
            note=note,
            nlp=nlp,
            additional_spans=additional_spans,
            extensions=extensions,
            **kwargs,
        )

    if how == "spark":
        if type(note) == pd.DataFrame:
            raise ValueError(
                """
                You are providing a pandas DataFrame with `how='spark'`,
                which is incompatible.
                """
            )

        if extensions and type(extensions) != dict:
            raise ValueError(
                """
                When using Spark, you should provide extension names
                along with the extension type (as a dictionnary):
                `d[extension_name] = extension_type`
                """  # noqa W291
            )

        return spark_pipe(
            note=note,
            nlp=nlp,
            additional_spans=additional_spans,
            extensions=extensions,
            **kwargs,
        )
Back to top