edsnlp.utils.batching
BatchSizeArg
Bases: Validated
Batch size argument validator / caster for confit/pydantic
Examples
def fn(batch_size: BatchSizeArg):
return batch_size
print(fn("10 samples"))
# Out: (10, "samples")
print(fn("10 words"))
# Out: (10, "words")
print(fn(10))
# Out: (10, "samples")
batchify [source]
Yields batch that contain at most batch_size elements. If an item contains more than batch_size elements, it will be yielded as a single batch.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
iterable | The iterable to batchify TYPE: |
batch_size | The maximum number of elements in a batch TYPE: |
drop_last | Whether to drop the last batch if it is smaller than TYPE: |
sentinel_mode | How to handle the sentinel values in the iterable:
TYPE: |
batchify_by_length_sum [source]
Yields batch that contain at most batch_size words. If an item contains more than batch_size words, it will be yielded as a single batch.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
iterable | The iterable to batchify TYPE: |
batch_size | The maximum number of words in a batch TYPE: |
drop_last | Whether to drop the last batch if it is smaller than TYPE: |
sentinel_mode | How to handle the sentinel values in the iterable:
TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
Iterable[List[T]] | |
batchify_by_padded [source]
Yields batch that contain at most batch_size padded words, ie the number of total words if all items were padded to the length of the longest item.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
iterable | The iterable to batchify TYPE: |
batch_size | The maximum number of padded words in a batch TYPE: |
drop_last | Whether to drop the last batch if it is smaller than TYPE: |
sentinel_mode | How to handle the sentinel values in the iterable:
TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
Iterable[List[T]] | |
batchify_by_dataset [source]
Yields batch that contain at most batch_size datasets. If an item contains more than batch_size datasets, it will be yielded as a single batch.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
iterable | The iterable to batchify TYPE: |
batch_size | Unused, always 1 full dataset per batch TYPE: |
drop_last | Whether to drop the last batch if it is smaller than TYPE: |
sentinel_mode | How to handle the sentinel values in the iterable:
TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
Iterable[List[T]] | |
batchify_by_fragment [source]
Yields batch that contain at most batch_size fragments. If an item contains more than batch_size fragments, it will be yielded as a single batch.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
iterable | The iterable to batchify TYPE: |
batch_size | Unused, always 1 full fragment per batch TYPE: |
drop_last | Whether to drop the last batch if it is smaller than TYPE: |
sentinel_mode | How to handle the sentinel values in the iterable:
TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
Iterable[List[T]] | |
stat_batchify [source]
Create a batching function that uses the value of a specific key in the items to determine the batch size. This function is primarily meant to be used on the flattened outputs of the preprocess method of a Pipeline object.
It expects the items to be a dictionary in which some keys contain the string "/stats/" and the key pattern. For instance:
from edsnlp.utils.batching import stat_batchify
items = [
{"text": "first sample", "obj/stats/words": 2, "obj/stats/chars": 12},
{"text": "dos", "obj/stats/words": 1, "obj/stats/chars": 3},
{"text": "third one !", "obj/stats/words": 3, "obj/stats/chars": 11},
]
batcher = stat_batchify("words")
assert list(batcher(items, 4)) == [
[
{"text": "first sample", "obj/stats/words": 2, "obj/stats/chars": 12},
{"text": "dos", "obj/stats/words": 1, "obj/stats/chars": 3},
],
[
{"text": "third one !", "obj/stats/words": 3, "obj/stats/chars": 11},
],
]
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
key | The key pattern to use to determine the actual key to look up in the items.
|
| RETURNS | DESCRIPTION |
|---|---|
Callable[[Iterable, int, bool, Literal["drop", "split"]], Iterable | |