edsnlp.utils.filter
default_sort_key
[source]
Returns the sort key for filtering spans.
Parameters
PARAMETER | DESCRIPTION |
---|---|
span | Span to sort. TYPE: |
RETURNS | DESCRIPTION |
---|---|
key | Sort key. TYPE: |
start_sort_key
[source]
Returns the sort key for filtering spans by start order.
Parameters
PARAMETER | DESCRIPTION |
---|---|
span | Span to sort. TYPE: |
RETURNS | DESCRIPTION |
---|---|
key | Sort key. TYPE: |
filter_spans
[source]
Re-definition of spacy's filtering function, that returns discarded spans as well as filtered ones.
Can also accept a label_to_remove
argument, useful for filtering out pseudo cues. If set, results
can contain overlapping spans: only spans overlapping with excluded labels are removed. The main expected use case is for pseudo-cues.
It can handle an iterable of tuples instead of an iterable of Span
s. The primary use-case is the use with the RegexMatcher
's capacity to return the span's groupdict
.
The spaCy documentation states:
Filter a sequence of spans and remove duplicates or overlaps. Useful for creating named entities (where one token can only be part of one entity) or when merging spans with
Retokenizer.merge
. When spans overlap, the (first) longest span is preferred over shorter spans.
Filtering out spans
If the label_to_remove
argument is supplied, it might be tempting to filter overlapping spans that are not part of a label to remove.
The reason we keep all other possibly overlapping labels is that in qualifier pipelines, the same cue can precede and follow a marked entity. Hence we need to keep every example.
Parameters
PARAMETER | DESCRIPTION |
---|---|
spans | Spans to filter. TYPE: |
return_discarded | Whether to return discarded spans. TYPE: |
label_to_remove | Label to remove. If set, results can contain overlapping spans. TYPE: |
sort_key | Key to sorting spans before applying overlap conflict resolution. A span with a higher key will have precedence over another span. By default, the largest, leftmost spans are selected first. TYPE: |
RETURNS | DESCRIPTION |
---|---|
results | Filtered spans TYPE: |
discarded | Discarded spans TYPE: |
consume_spans
[source]
Consume a list of span, according to a filter.
Warning
This method makes the hard hypothesis that:
- Spans are sorted.
- Spans are consumed in sequence and only once.
The second item is problematic for the way we treat long entities, hence the second_chance
parameter, which lets entities be seen more than once.
Parameters
PARAMETER | DESCRIPTION |
---|---|
spans | List of spans to filter TYPE: |
filter | Filtering function. Should return True when the item is to be included. TYPE: |
second_chance | Optional list of spans to include again (useful for long entities), by default None TYPE: |
RETURNS | DESCRIPTION |
---|---|
matches | List of spans consumed by the filter. TYPE: |
remainder | List of remaining spans in the original TYPE: |
get_spans
[source]
Extracts spans with a given label. Prefer using hash label for performance reasons.
Parameters
PARAMETER | DESCRIPTION |
---|---|
spans | List of spans to filter. TYPE: |
label | Label to filter on. TYPE: |
RETURNS | DESCRIPTION |
---|---|
List[Span] | Filtered spans. |
span_f1
[source]
Computes the F1 overlap between two spans.
Parameters
PARAMETER | DESCRIPTION |
---|---|
a | First span TYPE: |
b | Second span TYPE: |
RETURNS | DESCRIPTION |
---|---|
float | F1 overlap |
align_spans
[source]
Aligns two lists of spans, by matching source spans that overlap target spans. This function is optimized to avoid quadratic complexity.
Parameters
PARAMETER | DESCRIPTION |
---|---|
source_spans | List of spans to align. TYPE: |
target_spans | List of spans to align. TYPE: |
sort_by_overlap | Whether to sort the aligned spans by maximum dice/f1 overlap with the target span. TYPE: |
RETURNS | DESCRIPTION |
---|---|
List[List[Span]] | Subset of |
get_span_group
[source]
Get the spans of a span group that are contained inside a doclike object.
Parameters
PARAMETER | DESCRIPTION |
---|---|
doclike | Doclike object to act as a mask. TYPE: |
group | Group name from which to get the spans. TYPE: |
RETURNS | DESCRIPTION |
---|---|
List[Span] | List of spans. |