Split[source]
The eds.split
component splits a document into multiple documents based on a regex pattern or a maximum length.
Not for pipelines
This component is not meant to be used in a pipeline, but rather as a preprocessing step when dealing with a stream of documents as in the example below.
Examples
import edsnlp, edsnlp.pipes as eds
# Create the stream
stream = edsnlp.data.from_iterable(
["Sentence 1\n\nThis is another longer sentence more than 5 words"]
)
# Convert texts into docs
stream = stream.map_pipeline(edsnlp.blank("eds"))
# Apply the split component
stream = stream.map(eds.split(max_length=5, regex="\n{2,}"))
print(" | ".join(doc.text.strip() for doc in stream))
# Out:
# Sentence 1 | This is another longer sentence | more than 5 words
Parameters
PARAMETER | DESCRIPTION |
---|---|
max_length | The maximum length of the produced documents. If 0, the document will not be split based on length. TYPE: |
regex | The regex pattern to split the document on TYPE: |
filter_expr | An optional filter expression to filter the produced documents TYPE: |
randomize | The randomization factor to split the documents, to avoid producing documents that are all TYPE: |