Split[source]

The eds.split component splits a document into multiple documents based on a regex pattern or a maximum length.

Not for pipelines

This component is not meant to be used in a pipeline, but rather as a preprocessing step when dealing with a stream of documents as in the example below.

Examples

import edsnlp, edsnlp.pipes as eds

# Create the stream
stream = edsnlp.data.from_iterable(
    ["Sentence 1\n\nThis is another longer sentence more than 5 words"]
)

# Convert texts into docs
stream = stream.map_pipeline(edsnlp.blank("eds"))

# Apply the split component
stream = stream.map(eds.split(max_length=5, regex="\n{2,}"))

print(" | ".join(doc.text.strip() for doc in stream))
# Out:
# Sentence 1 | This is another longer sentence | more than 5 words

Parameters

PARAMETER	DESCRIPTION
`max_length`	The maximum length of the produced documents. If 0, the document will not be split based on length. TYPE: `int` DEFAULT: `0`
`regex`	The regex pattern to split the document on TYPE: `Optional[str]` DEFAULT: `'\n{2,}'`
`filter_expr`	An optional filter expression to filter the produced documents TYPE: `Optional[str]` DEFAULT: `None`
`randomize`	The randomization factor to split the documents, to avoid producing documents that are all `max_length` tokens long (0 means all documents will have the maximum possible length while 1 will produce documents with a length varying between 0 and `max_length` uniformly) TYPE: `float` DEFAULT: `0.0`