`edsnlp.utils.batching`

`BatchSizeArg`

Bases: Validated

Batch size argument validator / caster for confit/pydantic

Examples

def fn(batch_size: BatchSizeArg):
    return batch_size


print(fn("10 samples"))
# Out: (10, "samples")

print(fn("10 words"))
# Out: (10, "words")

print(fn(10))
# Out: (10, "samples")

`batchify` [source]

Yields batch that contain at most batch_size elements. If an item contains more than batch_size elements, it will be yielded as a single batch.

Parameters

PARAMETER	DESCRIPTION
`iterable`	The iterable to batchify TYPE: `Iterable[T]`
`batch_size`	The maximum number of elements in a batch TYPE: `int`
`drop_last`	Whether to drop the last batch if it is smaller than `batch_size` TYPE: `bool` DEFAULT: `False`
`sentinel_mode`	How to handle the sentinel values in the iterable: "drop": drop sentinel values "keep": keep sentinel values inside the produced batches "split": split batches at sentinel values and yield sentinel values separately TYPE: `Literal['drop', 'keep', 'split']` DEFAULT: `'drop'`

`batchify_by_length_sum` [source]

Yields batch that contain at most batch_size words. If an item contains more than batch_size words, it will be yielded as a single batch.

Parameters

PARAMETER	DESCRIPTION
`iterable`	The iterable to batchify TYPE: `Iterable[T]`
`batch_size`	The maximum number of words in a batch TYPE: `int`
`drop_last`	Whether to drop the last batch if it is smaller than `batch_size` TYPE: `bool` DEFAULT: `False`
`sentinel_mode`	How to handle the sentinel values in the iterable: "drop": drop sentinel values "keep": keep sentinel values inside the produced batches "split": split batches at sentinel values and yield sentinel values separately TYPE: `Literal['drop', 'keep', 'split']` DEFAULT: `'drop'`

RETURNS	DESCRIPTION
`Iterable[List[T]]`

`batchify_by_padded` [source]

Yields batch that contain at most batch_size padded words, ie the number of total words if all items were padded to the length of the longest item.

Parameters

PARAMETER	DESCRIPTION
`iterable`	The iterable to batchify TYPE: `Iterable[T]`
`batch_size`	The maximum number of padded words in a batch TYPE: `int`
`drop_last`	Whether to drop the last batch if it is smaller than `batch_size` TYPE: `bool` DEFAULT: `False`
`sentinel_mode`	How to handle the sentinel values in the iterable: "drop": drop sentinel values "keep": keep sentinel values inside the produced batches "split": split batches at sentinel values and yield sentinel values separately TYPE: `Literal['drop', 'keep', 'split']` DEFAULT: `'drop'`

RETURNS	DESCRIPTION
`Iterable[List[T]]`

`batchify_by_dataset` [source]

Yields batch that contain at most batch_size datasets. If an item contains more than batch_size datasets, it will be yielded as a single batch.

Parameters

PARAMETER	DESCRIPTION
`iterable`	The iterable to batchify TYPE: `Iterable[T]`
`batch_size`	Unused, always 1 full dataset per batch TYPE: `Optional[int]` DEFAULT: `None`
`drop_last`	Whether to drop the last batch if it is smaller than `batch_size` TYPE: `bool` DEFAULT: `False`
`sentinel_mode`	How to handle the sentinel values in the iterable: "drop": drop sentinel values "keep": keep sentinel values inside the produced batches "split": split batches at sentinel values and yield sentinel values separately TYPE: `Literal['drop', 'keep', 'split']` DEFAULT: `'drop'`

RETURNS	DESCRIPTION
`Iterable[List[T]]`

`batchify_by_fragment` [source]

Yields batch that contain at most batch_size fragments. If an item contains more than batch_size fragments, it will be yielded as a single batch.

Parameters

PARAMETER	DESCRIPTION
`iterable`	The iterable to batchify TYPE: `Iterable[T]`
`batch_size`	Unused, always 1 full fragment per batch TYPE: `Optional[int]` DEFAULT: `None`
`drop_last`	Whether to drop the last batch if it is smaller than `batch_size` TYPE: `bool` DEFAULT: `False`
`sentinel_mode`	How to handle the sentinel values in the iterable: "drop": drop sentinel values "keep": keep sentinel values inside the produced batches "split": split batches at sentinel values and yield sentinel values separately TYPE: `Literal['drop', 'keep', 'split']` DEFAULT: `'drop'`

RETURNS	DESCRIPTION
`Iterable[List[T]]`

`stat_batchify` [source]

Create a batching function that uses the value of a specific key in the items to determine the batch size. This function is primarily meant to be used on the flattened outputs of the preprocess method of a Pipeline object.

It expects the items to be a dictionary in which some keys contain the string "/stats/" and the key pattern. For instance:

from edsnlp.utils.batching import stat_batchify

items = [
    {"text": "first sample", "obj/stats/words": 2, "obj/stats/chars": 12},
    {"text": "dos", "obj/stats/words": 1, "obj/stats/chars": 3},
    {"text": "third one !", "obj/stats/words": 3, "obj/stats/chars": 11},
]
batcher = stat_batchify("words")
assert list(batcher(items, 4)) == [
    [
        {"text": "first sample", "obj/stats/words": 2, "obj/stats/chars": 12},
        {"text": "dos", "obj/stats/words": 1, "obj/stats/chars": 3},
    ],
    [
        {"text": "third one !", "obj/stats/words": 3, "obj/stats/chars": 11},
    ],
]

Parameters

PARAMETER DESCRIPTION

key

The key pattern to use to determine the actual key to look up in the items.

RETURNS	DESCRIPTION
`Callable[[Iterable, int, bool, Literal["drop", "split"]], Iterable`

edsnlp.utils.batching

BatchSizeArg

Examples

batchify [source]

Parameters

batchify_by_length_sum [source]

Parameters

batchify_by_padded [source]

Parameters

batchify_by_dataset [source]

Parameters

batchify_by_fragment [source]

Parameters

stat_batchify [source]

Parameters

`edsnlp.utils.batching`

`BatchSizeArg`

`batchify` [source]

`batchify_by_length_sum` [source]

`batchify_by_padded` [source]

`batchify_by_dataset` [source]

`batchify_by_fragment` [source]

`stat_batchify` [source]