`edsnlp.processing.distributed`

`pyspark_type_finder`

Returns (when possible) the PySpark type of any python object

`pipe`

Function to apply a spaCy pipe to a pyspark or koalas DataFrame note

Parameters

PARAMETER	DESCRIPTION
`note`	A Pyspark or Koalas DataFrame with a `note_id` and `note_text` column TYPE: `DataFrame`
`nlp`	A spaCy pipe TYPE: `Language`
`context`	A list of column to add to the generated SpaCy document as an extension. For instance, if `context=["note_datetime"], the corresponding value found in the`note_datetime`column will be stored in`doc._.note_datetime`, which can be useful e.g. for the`dates` pipeline. TYPE: `List[str]` DEFAULT: `[]`
`additional_spans`	A name (or list of names) of SpanGroup on which to apply the pipe too: SpanGroup are available as `doc.spans[spangroup_name]` and can be generated by some pipes. For instance, the `eds.dates` pipeline component populates `doc.spans['dates']` TYPE: `Union[List[str], str], by default "discarded"` DEFAULT: `'discarded'`
`extensions`	Spans extensions to add to the extracted results: For instance, if `extensions=["score_name"]`, the extracted result will include, for each entity, `ent._.score_name`. TYPE: `List[Tuple[str, T.DataType]], by default []` DEFAULT: `{}`

RETURNS	DESCRIPTION
`DataFrame`	A pyspark DataFrame with one line per extraction

`custom_pipe`

Function to apply a spaCy pipe to a pyspark or koalas DataFrame note, a generic callback function that converts a spaCy Doc object into a list of dictionaries.

Parameters

PARAMETER	DESCRIPTION
`note`	A Pyspark or Koalas DataFrame with a `note_text` column TYPE: `DataFrame`
`nlp`	A spaCy pipe TYPE: `Language`
`results_extractor`	Arbitrary function that takes extract serialisable results from the computed spaCy `Doc` object. The output of the function must be a list of dictionaries containing the extracted spans or entities. There is no requirement for all entities to provide every dictionary key. TYPE: `Callable[[Doc], List[Dict[str, Any]]]`
`dtypes`	Dictionary containing all expected keys from the `results_extractor` function, along with their types. TYPE: `Dict[str, DataType]`
`context`	A list of column to add to the generated SpaCy document as an extension. For instance, if `context=["note_datetime"], the corresponding value found in the`note_datetime`column will be stored in`doc._.note_datetime`, which can be useful e.g. for the`dates` pipeline. TYPE: `List[str]` DEFAULT: `[]`

RETURNS	DESCRIPTION
`DataFrame`	A pyspark DataFrame with one line per extraction