Skip to content

edsnlp.utils.batching

BatchSizeArg

Bases: Validated

Batch size argument validator / caster for confit/pydantic

Examples

def fn(batch_size: BatchSizeArg):
    return batch_size


print(fn("10 samples"))
# Out: (10, "samples")

print(fn("10 words"))
# Out: (10, "words")

print(fn(10))
# Out: (10, "samples")

batchify [source]

Yields batch that contain at most batch_size elements. If an item contains more than batch_size elements, it will be yielded as a single batch.

Parameters

PARAMETER DESCRIPTION
iterable

The iterable to batchify

TYPE: Iterable[T]

batch_size

The maximum number of elements in a batch

TYPE: int

drop_last

Whether to drop the last batch if it is smaller than batch_size

TYPE: bool DEFAULT: False

sentinel_mode

How to handle the sentinel values in the iterable:

  • "drop": drop sentinel values
  • "keep": keep sentinel values inside the produced batches
  • "split": split batches at sentinel values and yield sentinel values separately

TYPE: Literal['drop', 'keep', 'split'] DEFAULT: 'drop'

batchify_by_length_sum [source]

Yields batch that contain at most batch_size words. If an item contains more than batch_size words, it will be yielded as a single batch.

Parameters

PARAMETER DESCRIPTION
iterable

The iterable to batchify

TYPE: Iterable[T]

batch_size

The maximum number of words in a batch

TYPE: int

drop_last

Whether to drop the last batch if it is smaller than batch_size

TYPE: bool DEFAULT: False

sentinel_mode

How to handle the sentinel values in the iterable:

  • "drop": drop sentinel values
  • "keep": keep sentinel values inside the produced batches
  • "split": split batches at sentinel values and yield sentinel values separately

TYPE: Literal['drop', 'keep', 'split'] DEFAULT: 'drop'

RETURNS DESCRIPTION
Iterable[List[T]]

batchify_by_padded [source]

Yields batch that contain at most batch_size padded words, ie the number of total words if all items were padded to the length of the longest item.

Parameters

PARAMETER DESCRIPTION
iterable

The iterable to batchify

TYPE: Iterable[T]

batch_size

The maximum number of padded words in a batch

TYPE: int

drop_last

Whether to drop the last batch if it is smaller than batch_size

TYPE: bool DEFAULT: False

sentinel_mode

How to handle the sentinel values in the iterable:

  • "drop": drop sentinel values
  • "keep": keep sentinel values inside the produced batches
  • "split": split batches at sentinel values and yield sentinel values separately

TYPE: Literal['drop', 'keep', 'split'] DEFAULT: 'drop'

RETURNS DESCRIPTION
Iterable[List[T]]

batchify_by_dataset [source]

Yields batch that contain at most batch_size datasets. If an item contains more than batch_size datasets, it will be yielded as a single batch.

Parameters

PARAMETER DESCRIPTION
iterable

The iterable to batchify

TYPE: Iterable[T]

batch_size

Unused, always 1 full dataset per batch

TYPE: Optional[int] DEFAULT: None

drop_last

Whether to drop the last batch if it is smaller than batch_size

TYPE: bool DEFAULT: False

sentinel_mode

How to handle the sentinel values in the iterable:

  • "drop": drop sentinel values
  • "keep": keep sentinel values inside the produced batches
  • "split": split batches at sentinel values and yield sentinel values separately

TYPE: Literal['drop', 'keep', 'split'] DEFAULT: 'drop'

RETURNS DESCRIPTION
Iterable[List[T]]

batchify_by_fragment [source]

Yields batch that contain at most batch_size fragments. If an item contains more than batch_size fragments, it will be yielded as a single batch.

Parameters

PARAMETER DESCRIPTION
iterable

The iterable to batchify

TYPE: Iterable[T]

batch_size

Unused, always 1 full fragment per batch

TYPE: Optional[int] DEFAULT: None

drop_last

Whether to drop the last batch if it is smaller than batch_size

TYPE: bool DEFAULT: False

sentinel_mode

How to handle the sentinel values in the iterable:

  • "drop": drop sentinel values
  • "keep": keep sentinel values inside the produced batches
  • "split": split batches at sentinel values and yield sentinel values separately

TYPE: Literal['drop', 'keep', 'split'] DEFAULT: 'drop'

RETURNS DESCRIPTION
Iterable[List[T]]

stat_batchify [source]

Create a batching function that uses the value of a specific key in the items to determine the batch size. This function is primarily meant to be used on the flattened outputs of the preprocess method of a Pipeline object.

It expects the items to be a dictionary in which some keys contain the string "/stats/" and the key pattern. For instance:

from edsnlp.utils.batching import stat_batchify

items = [
    {"text": "first sample", "obj/stats/words": 2, "obj/stats/chars": 12},
    {"text": "dos", "obj/stats/words": 1, "obj/stats/chars": 3},
    {"text": "third one !", "obj/stats/words": 3, "obj/stats/chars": 11},
]
batcher = stat_batchify("words")
assert list(batcher(items, 4)) == [
    [
        {"text": "first sample", "obj/stats/words": 2, "obj/stats/chars": 12},
        {"text": "dos", "obj/stats/words": 1, "obj/stats/chars": 3},
    ],
    [
        {"text": "third one !", "obj/stats/words": 3, "obj/stats/chars": 11},
    ],
]

Parameters

PARAMETER DESCRIPTION
key

The key pattern to use to determine the actual key to look up in the items.

RETURNS DESCRIPTION
Callable[[Iterable, int, bool, Literal["drop", "split"]], Iterable