Skip to content

Explode[source]

Explode a Doc into multiple distinct Doc objects, one per span retrieved through the span_getter : each span becomes alone in its own Doc. Note that entities that are not selected by the span_getter will be lost in the new docs.

Not for pipelines

This component is not meant to be used in a pipeline, but rather as a preprocessing step when dealing with a stream of documents as in the example below.

Difference with eds.split

While eds.split breaks a document into smaller chunks based on length or regex rules, eds.explode creates a separate document for each selected span. This means eds.split is typically used for segmenting text for context size or processing constraints, whereas eds.explode is designed for span-level tasks that require span-level mixing, like training span classifiers, ensuring that each span is isolated in its own document while preserving the original context.

Examples

import edsnlp.pipes as eds
from edsnlp.data.converters import MarkupToDocConverter

converter = MarkupToDocConverter(
    preset="xml",
    # Put xml annotated spans in distinct doc.spans[label] groups
    span_setter={"*": True},
)
doc = converter(
    "Le <person>patient</person> a mal au <body_part>bras</body_part>, à la "
    "<body_part>jambe</body_part> et au <body_part>torse</body_part>"
)

exploder = eds.explode(span_getter=["body_part"])
print(doc.text, "->", doc.spans)
# Out: Le patient a mal au bras, à la jambe et au torse -> {'person': [patient], 'body_part': [bras, jambe, torse]}

for new_doc in exploder(doc):
    print(new_doc.text, "->", new_doc.spans)
# Out: Le patient a mal au bras, à la jambe et au torse -> {'person': [], 'body_part': [bras]}
# Out: Le patient a mal au bras, à la jambe et au torse -> {'person': [], 'body_part': [jambe]}
# Out: Le patient a mal au bras, à la jambe et au torse -> {'person': [], 'body_part': [torse]}

Parameters

PARAMETER DESCRIPTION
span_getter

The span getter to use to retrieve spans from the Doc. Default is {"ents": True} which retrieves all entities in doc.ents.

TYPE: SpanGetterArg DEFAULT: {'ents': True}

filter_expr

An optional filter expression to filter the produced documents. The callable expects a single argument, the new Doc, and should return a boolean.

TYPE: Optional[str] DEFAULT: None