Skip to content

edsnlp.processing.wrapper

pipe(note, nlp, n_jobs=-2, additional_spans='discarded', extensions=[], **kwargs)

Function to apply a spaCy pipe to a pandas or pyspark DataFrame

PARAMETER DESCRIPTION
note

A pandas/pyspark/koalas DataFrame with a note_id and note_text column

TYPE: DataFrame

nlp

A spaCy pipe

TYPE: Language

n_jobs

Only used when providing a Pandas DataFrame

  • n_jobs=1 corresponds to simple_pipe
  • n_jobs>1 corresponds to parallel_pipe with n_jobs parallel workers
  • n_jobs=-1 corresponds to parallel_pipe with maximun number of workers
  • n_jobs=-2 corresponds to parallel_pipe with maximun number of workers -1

TYPE: int, by default -2 DEFAULT: -2

additional_spans

A name (or list of names) of SpanGroup on which to apply the pipe too: SpanGroup are available as doc.spans[spangroup_name] and can be generated by some pipes. For instance, the date pipe populates doc.spans['dates']

TYPE: Union[List[str], str], by default "discarded" DEFAULT: 'discarded'

extensions

Spans extensions to add to the extracted results: For instance, if extensions=["score_name"], the extracted result will include, for each entity, ent._.score_name.

TYPE: List[Tuple[str, T.DataType]], by default [] DEFAULT: []

kwargs

Additional parameters depending on the how argument.

TYPE: Dict[str, Any]

RETURNS DESCRIPTION
DataFrame

A DataFrame with one line per extraction

Source code in edsnlp/processing/wrapper.py
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
def pipe(
    note: DataFrames,
    nlp: Language,
    n_jobs: int = -2,
    additional_spans: Union[List[str], str] = "discarded",
    extensions: ExtensionSchema = [],
    **kwargs: Dict[str, Any],
) -> DataFrames:
    """
    Function to apply a spaCy pipe to a pandas or pyspark DataFrame


    Parameters
    ----------
    note : DataFrame
        A pandas/pyspark/koalas DataFrame with a `note_id` and `note_text` column
    nlp : Language
        A spaCy pipe
    n_jobs : int, by default -2
        Only used when providing a Pandas DataFrame

        - `n_jobs=1` corresponds to `simple_pipe`
        - `n_jobs>1` corresponds to `parallel_pipe` with `n_jobs` parallel workers
        - `n_jobs=-1` corresponds to `parallel_pipe` with maximun number of workers
        - `n_jobs=-2` corresponds to `parallel_pipe` with maximun number of workers -1
    additional_spans : Union[List[str], str], by default "discarded"
        A name (or list of names) of SpanGroup on which to apply the pipe too:
        SpanGroup are available as `doc.spans[spangroup_name]` and can be generated
        by some pipes. For instance, the `date` pipe populates doc.spans['dates']
    extensions : List[Tuple[str, T.DataType]], by default []
        Spans extensions to add to the extracted results:
        For instance, if `extensions=["score_name"]`, the extracted result
        will include, for each entity, `ent._.score_name`.
    kwargs : Dict[str, Any]
        Additional parameters depending on the `how` argument.

    Returns
    -------
    DataFrame
        A DataFrame with one line per extraction
    """

    module = get_module(note)

    if module == DataFrameModules.PANDAS:
        if n_jobs == 1:

            return simple_pipe(
                note=note,
                nlp=nlp,
                additional_spans=additional_spans,
                extensions=extensions,
                **kwargs,
            )

        else:

            return parallel_pipe(
                note=note,
                nlp=nlp,
                additional_spans=additional_spans,
                extensions=extensions,
                n_jobs=n_jobs,
                **kwargs,
            )

    if extensions and type(extensions) != dict:
        raise ValueError(
            """
            When using Spark or Koalas, you should provide extension names
            along with the extension type (as a dictionnary):
            `d[extension_name] = extension_type`
            """  # noqa W291
        )

    from .distributed import pipe as distributed_pipe

    return distributed_pipe(
        note=note,
        nlp=nlp,
        additional_spans=additional_spans,
        extensions=extensions,
        **kwargs,
    )
Back to top