Skip to content

edsnlp.processing.distributed

pyspark_type_finder

Returns (when possible) the PySpark type of any python object

pipe

Function to apply a spaCy pipe to a pyspark or koalas DataFrame note

Parameters

PARAMETER DESCRIPTION
note

A Pyspark or Koalas DataFrame with a note_id and note_text column

TYPE: DataFrame

nlp

A spaCy pipe

TYPE: Language

context

A list of column to add to the generated SpaCy document as an extension. For instance, if context=["note_datetime"], the corresponding value found in thenote_datetimecolumn will be stored indoc._.note_datetime, which can be useful e.g. for thedates` pipeline.

TYPE: List[str] DEFAULT: []

additional_spans

A name (or list of names) of SpanGroup on which to apply the pipe too: SpanGroup are available as doc.spans[spangroup_name] and can be generated by some pipes. For instance, the eds.dates pipeline component populates doc.spans['dates']

TYPE: Union[List[str], str], by default "discarded" DEFAULT: 'discarded'

extensions

Spans extensions to add to the extracted results: For instance, if extensions=["score_name"], the extracted result will include, for each entity, ent._.score_name.

TYPE: List[Tuple[str, T.DataType]], by default [] DEFAULT: {}

RETURNS DESCRIPTION
DataFrame

A pyspark DataFrame with one line per extraction

custom_pipe

Function to apply a spaCy pipe to a pyspark or koalas DataFrame note, a generic callback function that converts a spaCy Doc object into a list of dictionaries.

Parameters

PARAMETER DESCRIPTION
note

A Pyspark or Koalas DataFrame with a note_text column

TYPE: DataFrame

nlp

A spaCy pipe

TYPE: Language

results_extractor

Arbitrary function that takes extract serialisable results from the computed spaCy Doc object. The output of the function must be a list of dictionaries containing the extracted spans or entities.

There is no requirement for all entities to provide every dictionary key.

TYPE: Callable[[Doc], List[Dict[str, Any]]]

dtypes

Dictionary containing all expected keys from the results_extractor function, along with their types.

TYPE: Dict[str, DataType]

context

A list of column to add to the generated SpaCy document as an extension. For instance, if context=["note_datetime"], the corresponding value found in thenote_datetimecolumn will be stored indoc._.note_datetime, which can be useful e.g. for thedates` pipeline.

TYPE: List[str] DEFAULT: []

RETURNS DESCRIPTION
DataFrame

A pyspark DataFrame with one line per extraction