Skip to content

edsnlp.processing.wrapper

pipe(note, nlp, n_jobs=-2, context=[], additional_spans='discarded', extensions=[], **kwargs)

Function to apply a spaCy pipe to a pandas or pyspark DataFrame

PARAMETER DESCRIPTION
note

A pandas/pyspark/koalas DataFrame with a note_id and note_text column

TYPE: DataFrame

nlp

A spaCy pipe

TYPE: Language

context

A list of column to add to the generated SpaCy document as an extension. For instance, if context=["note_datetime"], the corresponding value found in thenote_datetimecolumn will be stored indoc._.note_datetime, which can be useful e.g. for thedates` pipeline.

TYPE: List[str] DEFAULT: []

n_jobs

Only used when providing a Pandas DataFrame

  • n_jobs=1 corresponds to simple_pipe
  • n_jobs>1 corresponds to parallel_pipe with n_jobs parallel workers
  • n_jobs=-1 corresponds to parallel_pipe with maximun number of workers
  • n_jobs=-2 corresponds to parallel_pipe with maximun number of workers -1

TYPE: int, by default -2 DEFAULT: -2

additional_spans

A name (or list of names) of SpanGroup on which to apply the pipe too: SpanGroup are available as doc.spans[spangroup_name] and can be generated by some pipes. For instance, the date pipe populates doc.spans['dates']

TYPE: Union[List[str], str], by default "discarded" DEFAULT: 'discarded'

extensions

Spans extensions to add to the extracted results: For instance, if extensions=["score_name"], the extracted result will include, for each entity, ent._.score_name.

TYPE: List[Tuple[str, T.DataType]], by default [] DEFAULT: []

kwargs

Additional parameters depending on the how argument.

TYPE: Dict[str, Any]

RETURNS DESCRIPTION
DataFrame

A DataFrame with one line per extraction

Source code in edsnlp/processing/wrapper.py
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
def pipe(
    note: DataFrames,
    nlp: Language,
    n_jobs: int = -2,
    context: List[str] = [],
    additional_spans: Union[List[str], str] = "discarded",
    extensions: ExtensionSchema = [],
    **kwargs: Dict[str, Any],
) -> DataFrames:
    """
    Function to apply a spaCy pipe to a pandas or pyspark DataFrame


    Parameters
    ----------
    note : DataFrame
        A pandas/pyspark/koalas DataFrame with a `note_id` and `note_text` column
    nlp : Language
        A spaCy pipe
    context : List[str]
        A list of column to add to the generated SpaCy document as an extension.
        For instance, if `context=["note_datetime"], the corresponding value found
        in the `note_datetime` column will be stored in `doc._.note_datetime`,
        which can be useful e.g. for the `dates` pipeline.
    n_jobs : int, by default -2
        Only used when providing a Pandas DataFrame

        - `n_jobs=1` corresponds to `simple_pipe`
        - `n_jobs>1` corresponds to `parallel_pipe` with `n_jobs` parallel workers
        - `n_jobs=-1` corresponds to `parallel_pipe` with maximun number of workers
        - `n_jobs=-2` corresponds to `parallel_pipe` with maximun number of workers -1
    additional_spans : Union[List[str], str], by default "discarded"
        A name (or list of names) of SpanGroup on which to apply the pipe too:
        SpanGroup are available as `doc.spans[spangroup_name]` and can be generated
        by some pipes. For instance, the `date` pipe populates doc.spans['dates']
    extensions : List[Tuple[str, T.DataType]], by default []
        Spans extensions to add to the extracted results:
        For instance, if `extensions=["score_name"]`, the extracted result
        will include, for each entity, `ent._.score_name`.
    kwargs : Dict[str, Any]
        Additional parameters depending on the `how` argument.

    Returns
    -------
    DataFrame
        A DataFrame with one line per extraction
    """

    module = get_module(note)

    if module == DataFrameModules.PANDAS:
        if n_jobs == 1:

            return simple_pipe(
                note=note,
                nlp=nlp,
                context=context,
                additional_spans=additional_spans,
                extensions=extensions,
                **kwargs,
            )

        else:

            return parallel_pipe(
                note=note,
                nlp=nlp,
                context=context,
                additional_spans=additional_spans,
                extensions=extensions,
                n_jobs=n_jobs,
                **kwargs,
            )

    if extensions and type(extensions) != dict:
        raise ValueError(
            """
            When using Spark or Koalas, you should provide extension names
            along with the extension type (as a dictionnary):
            `d[extension_name] = extension_type`
            """  # noqa W291
        )

    from .distributed import pipe as distributed_pipe

    return distributed_pipe(
        note=note,
        nlp=nlp,
        context=context,
        additional_spans=additional_spans,
        extensions=extensions,
        **kwargs,
    )
Back to top