edsnlp.train
BatchSizeArg
Batch size argument validator / caster for confit/pydantic
Examples
def fn(batch_size: BatchSizeArg):
return batch_size
print(fn("10 samples"))
# Out: (10, "samples")
print(fn("10 words"))
# Out: (10, "words")
print(fn(10))
# Out: (10, "samples")
LengthSortedBatchSampler
Batch sampler that sorts the dataset by length and then batches sequences of similar length together. This is useful for transformer models that can then be padded more efficiently.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
dataset | The dataset to sample from (can be a generator or a fixed size collection)
|
batch_size | The batch size TYPE: |
batch_unit | The unit of the batch size, either "words" or "samples" TYPE: |
noise | The amount of noise to add to the sequence length before sorting (uniformly sampled in [-noise, noise]) DEFAULT: |
drop_last | Whether to drop the last batch if it is smaller than the batch size DEFAULT: |
buffer_size | The size of the buffer to use to shuffle the batches. If None, the buffer will be approximately the size of the dataset. TYPE: |
SubBatchCollater
Collater that splits batches into sub-batches of a maximum size
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
nlp | The pipeline object
|
embedding | The transformer embedding pipe
|
grad_accumulation_max_tokens | The maximum number of tokens (word pieces) to accumulate in a single batch
|
Reader
Bases: BaseModel
Reader that reads docs from a file or a generator, and adapts them to the pipeline.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
reader | The reader object
|
limit | The maximum number of docs to read
|
max_length | The maximum length of the resulting docs
|
randomize | Whether to randomize the split
|
multi_sentence | Whether to split sentences across multiple docs
|
filter_expr | An expression to filter the docs to generate
|
split_doc
Split a doc into multiple docs of max_length tokens.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
doc | The doc to split TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
Iterable[Doc] | |
subset_doc
Subset a doc given a start and end index.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
doc | The doc to subset TYPE: |
start | The start index TYPE: |
end | The end index TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
Doc | |