Skip to content

Split[source]

The eds.split component splits a document into multiple documents based on a regex pattern or a maximum length.

Not for pipelines

This component is not meant to be used in a pipeline, but rather as a preprocessing step when dealing with a stream of documents as in the example below.

Examples

import edsnlp, edsnlp.pipes as eds

# Create the stream
stream = edsnlp.data.from_iterable(
    ["Sentence 1\n\nThis is another longer sentence more than 5 words"]
)

# Convert texts into docs
stream = stream.map_pipeline(edsnlp.blank("eds"))

# Apply the split component
stream = stream.map(eds.split(max_length=5, regex="\n{2,}"))

print(" | ".join(doc.text.strip() for doc in stream))
# Out:
# Sentence 1 | This is another longer sentence | more than 5 words

Parameters

PARAMETER DESCRIPTION
max_length

The maximum length of the produced documents. If 0, the document will not be split based on length.

TYPE: int DEFAULT: 0

regex

The regex pattern to split the document on

TYPE: Optional[str] DEFAULT: '\n{2,}'

filter_expr

An optional filter expression to filter the produced documents

TYPE: Optional[str] DEFAULT: None

randomize

The randomization factor to split the documents, to avoid producing documents that are all max_length tokens long (0 means all documents will have the maximum possible length while 1 will produce documents with a length varying between 0 and max_length uniformly)

TYPE: float DEFAULT: 0.0