edsnlp.utils.filter
default_sort_key(span)
Returns the sort key for filtering spans.
PARAMETER | DESCRIPTION |
---|---|
span |
Span to sort.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
key
|
Sort key.
TYPE:
|
Source code in edsnlp/utils/filter.py
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
|
start_sort_key(span)
Returns the sort key for filtering spans by start order.
PARAMETER | DESCRIPTION |
---|---|
span |
Span to sort.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
key
|
Sort key.
TYPE:
|
Source code in edsnlp/utils/filter.py
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
|
filter_spans(spans, label_to_remove=None, return_discarded=False, sort_key=default_sort_key)
Re-definition of spacy's filtering function, that returns discarded spans as well as filtered ones.
Can also accept a label_to_remove
argument, useful for filtering out
pseudo cues. If set, results
can contain overlapping spans: only
spans overlapping with excluded labels are removed. The main expected
use case is for pseudo-cues.
It can handle an iterable of tuples instead of an iterable of Span
s.
The primary use-case is the use with the RegexMatcher
's capacity to
return the span's groupdict
.
The spaCy documentation states:
Filter a sequence of spans and remove duplicates or overlaps. Useful for creating named entities (where one token can only be part of one entity) or when merging spans with
Retokenizer.merge
. When spans overlap, the (first) longest span is preferred over shorter spans.
Filtering out spans
If the label_to_remove
argument is supplied, it might be tempting to
filter overlapping spans that are not part of a label to remove.
The reason we keep all other possibly overlapping labels is that in qualifier pipelines, the same cue can precede and follow a marked entity. Hence we need to keep every example.
PARAMETER | DESCRIPTION |
---|---|
spans |
Spans to filter.
TYPE:
|
return_discarded |
Whether to return discarded spans.
TYPE:
|
label_to_remove |
Label to remove. If set, results can contain overlapping spans.
TYPE:
|
sort_key |
Key to sorting spans before applying overlap conflict resolution. A span with a higher key will have precedence over another span. By default, the largest, leftmost spans are selected first.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
results
|
Filtered spans
TYPE:
|
discarded
|
Discarded spans
TYPE:
|
Source code in edsnlp/utils/filter.py
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 |
|
consume_spans(spans, filter, second_chance=None)
Consume a list of span, according to a filter.
Warning
This method makes the hard hypothesis that:
- Spans are sorted.
- Spans are consumed in sequence and only once.
The second item is problematic for the way we treat long entities,
hence the second_chance
parameter, which lets entities be seen
more than once.
PARAMETER | DESCRIPTION |
---|---|
spans |
List of spans to filter
TYPE:
|
filter |
Filtering function. Should return True when the item is to be included.
TYPE:
|
second_chance |
Optional list of spans to include again (useful for long entities), by default None
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
matches
|
List of spans consumed by the filter.
TYPE:
|
remainder
|
List of remaining spans in the original
TYPE:
|
Source code in edsnlp/utils/filter.py
131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 |
|
get_spans(spans, label)
Extracts spans with a given label. Prefer using hash label for performance reasons.
PARAMETER | DESCRIPTION |
---|---|
spans |
List of spans to filter.
TYPE:
|
label |
Label to filter on.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
List[Span]
|
Filtered spans. |
Source code in edsnlp/utils/filter.py
189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 |
|
span_f1(a, b)
Computes the F1 overlap between two spans.
PARAMETER | DESCRIPTION |
---|---|
a |
First span
TYPE:
|
b |
Second span
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
F1 overlap |
Source code in edsnlp/utils/filter.py
212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 |
|
align_spans(source, target, sort_by_overlap=False)
Aligns two lists of spans, by matching source spans that overlap target spans. This function is optimized to avoid quadratic complexity.
PARAMETER | DESCRIPTION |
---|---|
source |
List of spans to align.
TYPE:
|
target |
List of spans to align.
TYPE:
|
sort_by_overlap |
Whether to sort the aligned spans by maximum dice/f1 overlap with the target span.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
List[List[Span]]
|
Subset of |
Source code in edsnlp/utils/filter.py
234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 |
|
get_span_group(doclike, group)
Get the spans of a span group that are contained inside a doclike object.
PARAMETER | DESCRIPTION |
---|---|
doclike |
Doclike object to act as a mask.
TYPE:
|
group |
Group name from which to get the spans.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
List[Span]
|
List of spans. |
Source code in edsnlp/utils/filter.py
283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 |
|