edsnlp.utils.batching
BatchSizeArg
Bases: Validated
Batch size argument validator / caster for confit/pydantic
Examples
def fn(batch_size: BatchSizeArg):
return batch_size
print(fn("10 samples"))
# Out: (10, "samples")
print(fn("10 words"))
# Out: (10, "words")
print(fn(10))
# Out: (10, "samples")
batchify
[source]
Yields batch that contain at most batch_size
elements. If an item contains more than batch_size
elements, it will be yielded as a single batch.
Parameters
PARAMETER | DESCRIPTION |
---|---|
iterable | The iterable to batchify TYPE: |
batch_size | The maximum number of elements in a batch TYPE: |
drop_last | Whether to drop the last batch if it is smaller than TYPE: |
sentinel_mode | How to handle the sentinel values in the iterable:
TYPE: |
batchify_by_length_sum
[source]
Yields batch that contain at most batch_size
words. If an item contains more than batch_size
words, it will be yielded as a single batch.
Parameters
PARAMETER | DESCRIPTION |
---|---|
iterable | The iterable to batchify TYPE: |
batch_size | The maximum number of words in a batch TYPE: |
drop_last | Whether to drop the last batch if it is smaller than TYPE: |
sentinel_mode | How to handle the sentinel values in the iterable:
TYPE: |
RETURNS | DESCRIPTION |
---|---|
Iterable[List[T]] | |
batchify_by_padded
[source]
Yields batch that contain at most batch_size
padded words, ie the number of total words if all items were padded to the length of the longest item.
Parameters
PARAMETER | DESCRIPTION |
---|---|
iterable | The iterable to batchify TYPE: |
batch_size | The maximum number of padded words in a batch TYPE: |
drop_last | Whether to drop the last batch if it is smaller than TYPE: |
sentinel_mode | How to handle the sentinel values in the iterable:
TYPE: |
RETURNS | DESCRIPTION |
---|---|
Iterable[List[T]] | |
batchify_by_dataset
[source]
Yields batch that contain at most batch_size
datasets. If an item contains more than batch_size
datasets, it will be yielded as a single batch.
Parameters
PARAMETER | DESCRIPTION |
---|---|
iterable | The iterable to batchify TYPE: |
batch_size | Unused, always 1 full dataset per batch TYPE: |
drop_last | Whether to drop the last batch if it is smaller than TYPE: |
sentinel_mode | How to handle the sentinel values in the iterable:
TYPE: |
RETURNS | DESCRIPTION |
---|---|
Iterable[List[T]] | |
batchify_by_fragment
[source]
Yields batch that contain at most batch_size
fragments. If an item contains more than batch_size
fragments, it will be yielded as a single batch.
Parameters
PARAMETER | DESCRIPTION |
---|---|
iterable | The iterable to batchify TYPE: |
batch_size | Unused, always 1 full fragment per batch TYPE: |
drop_last | Whether to drop the last batch if it is smaller than TYPE: |
sentinel_mode | How to handle the sentinel values in the iterable:
TYPE: |
RETURNS | DESCRIPTION |
---|---|
Iterable[List[T]] | |
stat_batchify
[source]
Create a batching function that uses the value of a specific key in the items to determine the batch size. This function is primarily meant to be used on the flattened outputs of the preprocess
method of a Pipeline object.
It expects the items to be a dictionary in which some keys contain the string "/stats/" and the key
pattern. For instance:
from edsnlp.utils.batching import stat_batchify
items = [
{"text": "first sample", "obj/stats/words": 2, "obj/stats/chars": 12},
{"text": "dos", "obj/stats/words": 1, "obj/stats/chars": 3},
{"text": "third one !", "obj/stats/words": 3, "obj/stats/chars": 11},
]
batcher = stat_batchify("words")
assert list(batcher(items, 4)) == [
[
{"text": "first sample", "obj/stats/words": 2, "obj/stats/chars": 12},
{"text": "dos", "obj/stats/words": 1, "obj/stats/chars": 3},
],
[
{"text": "third one !", "obj/stats/words": 3, "obj/stats/chars": 11},
],
]
Parameters
PARAMETER | DESCRIPTION |
---|---|
key | The key pattern to use to determine the actual key to look up in the items.
|
RETURNS | DESCRIPTION |
---|---|
Callable[[Iterable, int, bool, Literal["drop", "split"]], Iterable | |