edsnlp.train
BatchSizeArg
Batch size argument validator / caster for confit/pydantic
Examples
def fn(batch_size: BatchSizeArg):
return batch_size
print(fn("10 samples"))
# Out: (10, "samples")
print(fn("10 words"))
# Out: (10, "words")
print(fn(10))
# Out: (10, "samples")
LengthSortedBatchSampler
Batch sampler that sorts the dataset by length and then batches sequences of similar length together. This is useful for transformer models that can then be padded more efficiently.
Parameters
PARAMETER | DESCRIPTION |
---|---|
dataset | The dataset to sample from (can be a generator or a fixed size collection)
|
batch_size | The batch size TYPE: |
batch_unit | The unit of the batch size, either "words" or "samples" TYPE: |
noise | The amount of noise to add to the sequence length before sorting (uniformly sampled in [-noise, noise]) DEFAULT: |
drop_last | Whether to drop the last batch if it is smaller than the batch size DEFAULT: |
buffer_size | The size of the buffer to use to shuffle the batches. If None, the buffer will be approximately the size of the dataset. TYPE: |
SubBatchCollater
Collater that splits batches into sub-batches of a maximum size
Parameters
PARAMETER | DESCRIPTION |
---|---|
nlp | The pipeline object
|
embedding | The transformer embedding pipe
|
grad_accumulation_max_tokens | The maximum number of tokens (word pieces) to accumulate in a single batch
|
Reader
Bases: BaseModel
Reader that reads docs from a file or a generator, and adapts them to the pipeline.
Parameters
PARAMETER | DESCRIPTION |
---|---|
reader | The reader object
|
limit | The maximum number of docs to read
|
max_length | The maximum length of the resulting docs
|
randomize | Whether to randomize the split
|
multi_sentence | Whether to split sentences across multiple docs
|
filter_expr | An expression to filter the docs to generate
|
split_doc
Split a doc into multiple docs of max_length tokens.
Parameters
PARAMETER | DESCRIPTION |
---|---|
doc | The doc to split TYPE: |
RETURNS | DESCRIPTION |
---|---|
Iterable[Doc] | |
subset_doc
Subset a doc given a start and end index.
Parameters
PARAMETER | DESCRIPTION |
---|---|
doc | The doc to subset TYPE: |
start | The start index TYPE: |
end | The end index TYPE: |
RETURNS | DESCRIPTION |
---|---|
Doc | |