Skip to content

edsnlp.train

BatchSizeArg

Batch size argument validator / caster for confit/pydantic

Examples

def fn(batch_size: BatchSizeArg):
    return batch_size


print(fn("10 samples"))
# Out: (10, "samples")

print(fn("10 words"))
# Out: (10, "words")

print(fn(10))
# Out: (10, "samples")

LengthSortedBatchSampler

Batch sampler that sorts the dataset by length and then batches sequences of similar length together. This is useful for transformer models that can then be padded more efficiently.

Parameters

PARAMETER DESCRIPTION
dataset

The dataset to sample from (can be a generator or a fixed size collection)

batch_size

The batch size

TYPE: int

batch_unit

The unit of the batch size, either "words" or "samples"

TYPE: str

noise

The amount of noise to add to the sequence length before sorting (uniformly sampled in [-noise, noise])

DEFAULT: 1

drop_last

Whether to drop the last batch if it is smaller than the batch size

DEFAULT: True

buffer_size

The size of the buffer to use to shuffle the batches. If None, the buffer will be approximately the size of the dataset.

TYPE: Optional[int] DEFAULT: None

SubBatchCollater

Collater that splits batches into sub-batches of a maximum size

Parameters

PARAMETER DESCRIPTION
nlp

The pipeline object

embedding

The transformer embedding pipe

grad_accumulation_max_tokens

The maximum number of tokens (word pieces) to accumulate in a single batch

Reader

Bases: BaseModel

Reader that reads docs from a file or a generator, and adapts them to the pipeline.

Parameters

PARAMETER DESCRIPTION
reader

The reader object

limit

The maximum number of docs to read

max_length

The maximum length of the resulting docs

randomize

Whether to randomize the split

multi_sentence

Whether to split sentences across multiple docs

filter_expr

An expression to filter the docs to generate

split_doc

Split a doc into multiple docs of max_length tokens.

Parameters

PARAMETER DESCRIPTION
doc

The doc to split

TYPE: Doc

RETURNS DESCRIPTION
Iterable[Doc]

subset_doc

Subset a doc given a start and end index.

Parameters

PARAMETER DESCRIPTION
doc

The doc to subset

TYPE: Doc

start

The start index

TYPE: int

end

The end index

TYPE: int

RETURNS DESCRIPTION
Doc