edsnlp.utils.filter
default_sort_key [source]
Returns the sort key for filtering spans.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
span | Span to sort. TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
key | Sort key. TYPE: |
start_sort_key [source]
Returns the sort key for filtering spans by start order.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
span | Span to sort. TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
key | Sort key. TYPE: |
filter_spans [source]
Re-definition of spacy's filtering function, that returns discarded spans as well as filtered ones.
Can also accept a label_to_remove argument, useful for filtering out pseudo cues. If set, results can contain overlapping spans: only spans overlapping with excluded labels are removed. The main expected use case is for pseudo-cues.
It can handle an iterable of tuples instead of an iterable of Spans. The primary use-case is the use with the RegexMatcher's capacity to return the span's groupdict.
The spaCy documentation states:
Filter a sequence of spans and remove duplicates or overlaps. Useful for creating named entities (where one token can only be part of one entity) or when merging spans with
Retokenizer.merge. When spans overlap, the (first) longest span is preferred over shorter spans.
Filtering out spans
If the label_to_remove argument is supplied, it might be tempting to filter overlapping spans that are not part of a label to remove.
The reason we keep all other possibly overlapping labels is that in qualifier pipelines, the same cue can precede and follow a marked entity. Hence we need to keep every example.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
spans | Spans to filter. TYPE: |
return_discarded | Whether to return discarded spans. TYPE: |
label_to_remove | Label to remove. If set, results can contain overlapping spans. TYPE: |
sort_key | Key to sorting spans before applying overlap conflict resolution. A span with a higher key will have precedence over another span. By default, the largest, leftmost spans are selected first. TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
results | Filtered spans TYPE: |
discarded | Discarded spans TYPE: |
consume_spans [source]
Consume a list of span, according to a filter.
Warning
This method makes the hard hypothesis that:
- Spans are sorted.
- Spans are consumed in sequence and only once.
The second item is problematic for the way we treat long entities, hence the second_chance parameter, which lets entities be seen more than once.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
spans | List of spans to filter TYPE: |
filter | Filtering function. Should return True when the item is to be included. TYPE: |
second_chance | Optional list of spans to include again (useful for long entities), by default None TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
matches | List of spans consumed by the filter. TYPE: |
remainder | List of remaining spans in the original TYPE: |
get_spans [source]
Extracts spans with a given label. Prefer using hash label for performance reasons.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
spans | List of spans to filter. TYPE: |
label | Label to filter on. TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
List[Span] | Filtered spans. |
span_f1 [source]
Computes the F1 overlap between two spans.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
a | First span TYPE: |
b | Second span TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
float | F1 overlap |
align_spans [source]
Aligns two lists of spans, by matching source spans that overlap target spans. This function is optimized to avoid quadratic complexity.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
source_spans | List of spans to align. TYPE: |
target_spans | List of spans to align. TYPE: |
sort_by_overlap | Whether to sort the aligned spans by maximum dice/f1 overlap with the target span. TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
List[List[Span]] | Subset of |
get_span_group [source]
Get the spans of a span group that are contained inside a doclike object.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
doclike | Doclike object to act as a mask. TYPE: |
group | Group name from which to get the spans. TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
List[Span] | List of spans. |