`edsnlp.train`

`BatchSizeArg`

Batch size argument validator / caster for confit/pydantic

Examples

def fn(batch_size: BatchSizeArg):
    return batch_size


print(fn("10 samples"))
# Out: (10, "samples")

print(fn("10 words"))
# Out: (10, "words")

print(fn(10))
# Out: (10, "samples")

`LengthSortedBatchSampler`

Batch sampler that sorts the dataset by length and then batches sequences of similar length together. This is useful for transformer models that can then be padded more efficiently.

Parameters

PARAMETER	DESCRIPTION
`dataset`	The dataset to sample from (can be a generator or a fixed size collection)
`batch_size`	The batch size TYPE: `int`
`batch_unit`	The unit of the batch size, either "words" or "samples" TYPE: `str`
`noise`	The amount of noise to add to the sequence length before sorting (uniformly sampled in [-noise, noise]) DEFAULT: `1`
`drop_last`	Whether to drop the last batch if it is smaller than the batch size DEFAULT: `True`
`buffer_size`	The size of the buffer to use to shuffle the batches. If None, the buffer will be approximately the size of the dataset. TYPE: `Optional[int]` DEFAULT: `None`

`SubBatchCollater`

Collater that splits batches into sub-batches of a maximum size

Parameters

PARAMETER DESCRIPTION

nlp

The pipeline object

embedding

The transformer embedding pipe

grad_accumulation_max_tokens

The maximum number of tokens (word pieces) to accumulate in a single batch

`Reader`

Bases: BaseModel

Reader that reads docs from a file or a generator, and adapts them to the pipeline.

Parameters

PARAMETER	DESCRIPTION
`reader`	The reader object
`limit`	The maximum number of docs to read
`max_length`	The maximum length of the resulting docs
`randomize`	Whether to randomize the split
`multi_sentence`	Whether to split sentences across multiple docs
`filter_expr`	An expression to filter the docs to generate

`split_doc`

Split a doc into multiple docs of max_length tokens.

Parameters

PARAMETER DESCRIPTION

doc

The doc to split

TYPE: Doc

RETURNS	DESCRIPTION
`Iterable[Doc]`

`subset_doc`

Subset a doc given a start and end index.

Parameters

PARAMETER DESCRIPTION

doc

The doc to subset

TYPE: Doc

start

The start index

TYPE: int

end

The end index

TYPE: int

RETURNS	DESCRIPTION
`Doc`