Skip to content

edsnlp.utils.filter

default_sort_key [source]

Returns the sort key for filtering spans.

Parameters

PARAMETER DESCRIPTION
span

Span to sort.

TYPE: Span

RETURNS DESCRIPTION
key

Sort key.

TYPE: Tuple(int, int)

start_sort_key [source]

Returns the sort key for filtering spans by start order.

Parameters

PARAMETER DESCRIPTION
span

Span to sort.

TYPE: Span

RETURNS DESCRIPTION
key

Sort key.

TYPE: Tuple(int, int)

filter_spans [source]

Re-definition of spacy's filtering function, that returns discarded spans as well as filtered ones.

Can also accept a label_to_remove argument, useful for filtering out pseudo cues. If set, results can contain overlapping spans: only spans overlapping with excluded labels are removed. The main expected use case is for pseudo-cues.

It can handle an iterable of tuples instead of an iterable of Spans. The primary use-case is the use with the RegexMatcher's capacity to return the span's groupdict.

The spaCy documentation states:

Filter a sequence of spans and remove duplicates or overlaps. Useful for creating named entities (where one token can only be part of one entity) or when merging spans with Retokenizer.merge. When spans overlap, the (first) longest span is preferred over shorter spans.

Filtering out spans

If the label_to_remove argument is supplied, it might be tempting to filter overlapping spans that are not part of a label to remove.

The reason we keep all other possibly overlapping labels is that in qualifier pipelines, the same cue can precede and follow a marked entity. Hence we need to keep every example.

Parameters

PARAMETER DESCRIPTION
spans

Spans to filter.

TYPE: Iterable[Union[Span, Tuple[Span, Any]]]

return_discarded

Whether to return discarded spans.

TYPE: bool DEFAULT: False

label_to_remove

Label to remove. If set, results can contain overlapping spans.

TYPE: str DEFAULT: None

sort_key

Key to sorting spans before applying overlap conflict resolution. A span with a higher key will have precedence over another span. By default, the largest, leftmost spans are selected first.

TYPE: Callable[Span, Any] DEFAULT: default_sort_key

RETURNS DESCRIPTION
results

Filtered spans

TYPE: List[Union[Span, Tuple[Span, Any]]]

discarded

Discarded spans

TYPE: (List[Union[Span, Tuple[Span, Any]]], optional)

consume_spans [source]

Consume a list of span, according to a filter.

Warning

This method makes the hard hypothesis that:

  1. Spans are sorted.
  2. Spans are consumed in sequence and only once.

The second item is problematic for the way we treat long entities, hence the second_chance parameter, which lets entities be seen more than once.

Parameters

PARAMETER DESCRIPTION
spans

List of spans to filter

TYPE: List of spans

filter

Filtering function. Should return True when the item is to be included.

TYPE: Callable

second_chance

Optional list of spans to include again (useful for long entities), by default None

TYPE: List of spans DEFAULT: None

RETURNS DESCRIPTION
matches

List of spans consumed by the filter.

TYPE: List of spans

remainder

List of remaining spans in the original spans parameter.

TYPE: List of spans

get_spans [source]

Extracts spans with a given label. Prefer using hash label for performance reasons.

Parameters

PARAMETER DESCRIPTION
spans

List of spans to filter.

TYPE: List[Span]

label

Label to filter on.

TYPE: Union[int, str]

RETURNS DESCRIPTION
List[Span]

Filtered spans.

span_f1 [source]

Computes the F1 overlap between two spans.

Parameters

PARAMETER DESCRIPTION
a

First span

TYPE: Span

b

Second span

TYPE: Span

RETURNS DESCRIPTION
float

F1 overlap

align_spans [source]

Aligns two lists of spans, by matching source spans that overlap target spans. This function is optimized to avoid quadratic complexity.

Parameters

PARAMETER DESCRIPTION
source_spans

List of spans to align.

TYPE: List[Span]

target_spans

List of spans to align.

TYPE: List[Span]

sort_by_overlap

Whether to sort the aligned spans by maximum dice/f1 overlap with the target span.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
List[List[Span]]

Subset of source spans for each target span

get_span_group [source]

Get the spans of a span group that are contained inside a doclike object.

Parameters

PARAMETER DESCRIPTION
doclike

Doclike object to act as a mask.

TYPE: Union[Doc, Span]

group

Group name from which to get the spans.

TYPE: str

RETURNS DESCRIPTION
List[Span]

List of spans.