Explode[source]
Explode a Doc into multiple distinct Doc objects, one per span retrieved through the span_getter : each span becomes alone in its own Doc. Note that entities that are not selected by the span_getter will be lost in the new docs.
Not for pipelines
This component is not meant to be used in a pipeline, but rather as a preprocessing step when dealing with a stream of documents as in the example below.
Difference with eds.split
While eds.split breaks a document into smaller chunks based on length or regex rules, eds.explode creates a separate document for each selected span. This means eds.split is typically used for segmenting text for context size or processing constraints, whereas eds.explode is designed for span-level tasks that require span-level mixing, like training span classifiers, ensuring that each span is isolated in its own document while preserving the original context.
Examples
import edsnlp.pipes as eds
from edsnlp.data.converters import MarkupToDocConverter
converter = MarkupToDocConverter(
preset="xml",
# Put xml annotated spans in distinct doc.spans[label] groups
span_setter={"*": True},
)
doc = converter(
"Le <person>patient</person> a mal au <body_part>bras</body_part>, à la "
"<body_part>jambe</body_part> et au <body_part>torse</body_part>"
)
exploder = eds.explode(span_getter=["body_part"])
print(doc.text, "->", doc.spans)
# Out: Le patient a mal au bras, à la jambe et au torse -> {'person': [patient], 'body_part': [bras, jambe, torse]}
for new_doc in exploder(doc):
print(new_doc.text, "->", new_doc.spans)
# Out: Le patient a mal au bras, à la jambe et au torse -> {'person': [], 'body_part': [bras]}
# Out: Le patient a mal au bras, à la jambe et au torse -> {'person': [], 'body_part': [jambe]}
# Out: Le patient a mal au bras, à la jambe et au torse -> {'person': [], 'body_part': [torse]}
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
span_getter | The span getter to use to retrieve spans from the Doc. Default is TYPE: |
filter_expr | An optional filter expression to filter the produced documents. The callable expects a single argument, the new Doc, and should return a boolean. TYPE: |