edsnlp.utils.filter
get_sort_key(span)
Returns the sort key for filtering spans.
PARAMETER | DESCRIPTION |
---|---|
span |
Span to sort.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
key
|
Sort key.
TYPE:
|
Source code in edsnlp/utils/filter.py
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
|
filter_spans(spans, label_to_remove=None, return_discarded=False)
Re-definition of spacy's filtering function, that returns discarded spans as well as filtered ones.
Can also accept a label_to_remove
argument, useful for filtering out
pseudo cues. If set, results
can contain overlapping spans: only
spans overlapping with excluded labels are removed. The main expected
use case is for pseudo-cues.
The spaCy documentation states:
Filter a sequence of spans and remove duplicates or overlaps. Useful for creating named entities (where one token can only be part of one entity) or when merging spans with
Retokenizer.merge
. When spans overlap, the (first) longest span is preferred over shorter spans.
Filtering out spans
If the label_to_remove
argument is supplied, it might be tempting to
filter overlapping spans that are not part of a label to remove.
The reason we keep all other possibly overlapping labels is that in qualifier pipelines, the same cue can precede and follow a marked entity. Hence we need to keep every example.
PARAMETER | DESCRIPTION |
---|---|
spans |
Spans to filter.
TYPE:
|
return_discarded |
Whether to return discarded spans.
TYPE:
|
label_to_remove |
Label to remove. If set, results can contain overlapping spans.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
results
|
Filtered spans
TYPE:
|
discarded
|
Discarded spans
TYPE:
|
Source code in edsnlp/utils/filter.py
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 |
|
consume_spans(spans, filter, second_chance=None)
Consume a list of span, according to a filter.
Warning
This method makes the hard hypothesis that:
- Spans are sorted.
- Spans are consumed in sequence and only once.
The second item is problematic for the way we treat long entities,
hence the second_chance
parameter, which lets entities be seen
more than once.
PARAMETER | DESCRIPTION |
---|---|
spans |
List of spans to filter
TYPE:
|
filter |
Filtering function. Should return True when the item is to be included.
TYPE:
|
second_chance |
Optional list of spans to include again (useful for long entities), by default None
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
matches
|
List of spans consumed by the filter.
TYPE:
|
remainder
|
List of remaining spans in the original
TYPE:
|
Source code in edsnlp/utils/filter.py
95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 |
|
get_spans(spans, label)
Extracts spans with a given label. Prefer using hash label for performance reasons.
PARAMETER | DESCRIPTION |
---|---|
spans |
List of spans to filter.
TYPE:
|
label |
Label to filter on.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
List[Span]
|
Filtered spans. |
Source code in edsnlp/utils/filter.py
153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 |
|