edsnlp.utils
colors
CATEGORY20 = ['#1f77b4', '#aec7e8', '#ff7f0e', '#ffbb78', '#2ca02c', '#98df8a', '#d62728', '#ff9896', '#9467bd', '#c5b0d5', '#8c564b', '#c49c94', '#e377c2', '#f7b6d2', '#7f7f7f', '#c7c7c7', '#bcbd22', '#dbdb8d', '#17becf', '#9edae5']
module-attribute
create_colors(labels)
Assign a colour for each label, using category20 palette. The method loops over the colour palette in case there are too many labels.
PARAMETER | DESCRIPTION |
---|---|
labels |
List of labels to colorise in displacy.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Dict[str, str]
|
A displacy-compatible colour assignment. |
Source code in edsnlp/utils/colors.py
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
|
deprecation
deprecated_extension(name, new_name)
Source code in edsnlp/utils/deprecation.py
9 10 11 12 13 14 15 16 |
|
deprecated_getter_factory(name, new_name)
Source code in edsnlp/utils/deprecation.py
19 20 21 22 23 24 25 26 27 28 29 |
|
deprecation(name, new_name=None)
Source code in edsnlp/utils/deprecation.py
32 33 34 35 36 37 38 39 40 41 42 |
|
deprecated_factory(name, new_name=None, default_config=None, func=None)
Execute the Language.factory method on a modified factory function. The modification adds a deprecation warning.
PARAMETER | DESCRIPTION |
---|---|
name |
The deprecated name for the pipeline
TYPE:
|
new_name |
The new name for the pipeline, which should be used, by default None
TYPE:
|
default_config |
The configuration that should be passed to Language.factory, by default None
TYPE:
|
func |
The function to decorate, by default None
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Callable
|
Source code in edsnlp/utils/deprecation.py
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 |
|
examples
entity_pattern = re.compile('(<ent[^<>]*>[^<>]+</ent>)')
module-attribute
text_pattern = re.compile('<ent.*>(.+)</ent>')
module-attribute
modifiers_pattern = re.compile('<ent\\s?(.*)>.+</ent>')
module-attribute
Match
Bases: BaseModel
Source code in edsnlp/utils/examples.py
7 8 9 10 11 |
|
start_char: int = None
class-attribute
end_char: int = None
class-attribute
text: str = None
class-attribute
modifiers: str = None
class-attribute
Modifier
Bases: BaseModel
Source code in edsnlp/utils/examples.py
14 15 16 |
|
key: str = None
class-attribute
value: Union[int, float, bool, str] = None
class-attribute
Entity
Bases: BaseModel
Source code in edsnlp/utils/examples.py
19 20 21 22 |
|
start_char: int = None
class-attribute
end_char: int = None
class-attribute
modifiers: List[Modifier] = None
class-attribute
find_matches(example)
Finds entities within the example.
PARAMETER | DESCRIPTION |
---|---|
example |
Example to process.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
List[re.Match]
|
List of matches for entities. |
Source code in edsnlp/utils/examples.py
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
|
parse_match(match)
Parse a regex match representing an entity.
PARAMETER | DESCRIPTION |
---|---|
match |
Match for an entity.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Match
|
Usable representation for the entity match. |
Source code in edsnlp/utils/examples.py
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 |
|
parse_example(example)
Parses an example : finds examples and removes the tags.
PARAMETER | DESCRIPTION |
---|---|
example |
Example to process.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Tuple[str, List[Entity]]
|
Cleaned text and extracted entities. |
Source code in edsnlp/utils/examples.py
74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 |
|
filter
default_sort_key(span)
Returns the sort key for filtering spans.
PARAMETER | DESCRIPTION |
---|---|
span |
Span to sort.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
key
|
Sort key.
TYPE:
|
Source code in edsnlp/utils/filter.py
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
|
start_sort_key(span)
Returns the sort key for filtering spans by start order.
PARAMETER | DESCRIPTION |
---|---|
span |
Span to sort.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
key
|
Sort key.
TYPE:
|
Source code in edsnlp/utils/filter.py
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
|
filter_spans(spans, label_to_remove=None, return_discarded=False, sort_key=default_sort_key)
Re-definition of spacy's filtering function, that returns discarded spans as well as filtered ones.
Can also accept a label_to_remove
argument, useful for filtering out
pseudo cues. If set, results
can contain overlapping spans: only
spans overlapping with excluded labels are removed. The main expected
use case is for pseudo-cues.
It can handle an iterable of tuples instead of an iterable of Span
s.
The primary use-case is the use with the RegexMatcher
's capacity to
return the span's groupdict
.
The spaCy documentation states:
Filter a sequence of spans and remove duplicates or overlaps. Useful for creating named entities (where one token can only be part of one entity) or when merging spans with
Retokenizer.merge
. When spans overlap, the (first) longest span is preferred over shorter spans.
Filtering out spans
If the label_to_remove
argument is supplied, it might be tempting to
filter overlapping spans that are not part of a label to remove.
The reason we keep all other possibly overlapping labels is that in qualifier pipelines, the same cue can precede and follow a marked entity. Hence we need to keep every example.
PARAMETER | DESCRIPTION |
---|---|
spans |
Spans to filter.
TYPE:
|
return_discarded |
Whether to return discarded spans.
TYPE:
|
label_to_remove |
Label to remove. If set, results can contain overlapping spans.
TYPE:
|
sort_key |
Key to sorting spans before applying overlap conflict resolution. A span with a higher key will have precedence over another span. By default, the largest, leftmost spans are selected first.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
results
|
Filtered spans
TYPE:
|
discarded
|
Discarded spans
TYPE:
|
Source code in edsnlp/utils/filter.py
44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 |
|
consume_spans(spans, filter, second_chance=None)
Consume a list of span, according to a filter.
Warning
This method makes the hard hypothesis that:
- Spans are sorted.
- Spans are consumed in sequence and only once.
The second item is problematic for the way we treat long entities,
hence the second_chance
parameter, which lets entities be seen
more than once.
PARAMETER | DESCRIPTION |
---|---|
spans |
List of spans to filter
TYPE:
|
filter |
Filtering function. Should return True when the item is to be included.
TYPE:
|
second_chance |
Optional list of spans to include again (useful for long entities), by default None
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
matches
|
List of spans consumed by the filter.
TYPE:
|
remainder
|
List of remaining spans in the original
TYPE:
|
Source code in edsnlp/utils/filter.py
129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 |
|
get_spans(spans, label)
Extracts spans with a given label. Prefer using hash label for performance reasons.
PARAMETER | DESCRIPTION |
---|---|
spans |
List of spans to filter.
TYPE:
|
label |
Label to filter on.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
List[Span]
|
Filtered spans. |
Source code in edsnlp/utils/filter.py
187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 |
|
resources
get_verbs(verbs=None, check_contains=True)
Extract verbs from the resources, as a pandas dataframe.
PARAMETER | DESCRIPTION |
---|---|
verbs |
List of verbs to keep. Returns all verbs by default.
TYPE:
|
check_contains |
Whether to check that no verb is missing if a list of verbs was provided. By default True
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
pd.DataFrame
|
DataFrame containing conjugated verbs. |
Source code in edsnlp/utils/resources.py
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
|
regex
make_pattern(patterns, with_breaks=False, name=None)
Create OR pattern from a list of patterns.
PARAMETER | DESCRIPTION |
---|---|
patterns |
List of patterns to merge.
TYPE:
|
with_breaks |
Whether to add breaks (
TYPE:
|
name |
Name of the group, using regex
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
str
|
Merged pattern. |
Source code in edsnlp/utils/regex.py
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
|
compile_regex(reg)
This function tries to compile reg
using the re
module, and
fallbacks to the regex
module that is more permissive.
PARAMETER | DESCRIPTION |
---|---|
reg |
|
RETURNS | DESCRIPTION |
---|---|
Union[re.Pattern, regex.Pattern]
|
Source code in edsnlp/utils/regex.py
46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 |
|
inclusion
check_inclusion(span, start, end)
Checks whether the span overlaps the boundaries.
PARAMETER | DESCRIPTION |
---|---|
span |
Span to check.
TYPE:
|
start |
Start of the boundary
TYPE:
|
end |
End of the boundary
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
bool
|
Whether the span overlaps the boundaries. |
Source code in edsnlp/utils/inclusion.py
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
|
blocs
Utility that extracts code blocs and runs them.
Largely inspired by https://github.com/koaning/mktestdocs
BLOCK_PATTERN = re.compile('((?P<skip><!-- no-check -->)\\s+)?(?P<indent> *)```(?P<title>.*?)\\n(?P<code>.+?)```', flags=re.DOTALL)
module-attribute
OUTPUT_PATTERN = '# Out: '
module-attribute
check_outputs(code)
Looks for output patterns, and modifies the bloc:
- The preceding line becomes
v = expr
- The output line becomes an
assert
statement
PARAMETER | DESCRIPTION |
---|---|
code |
Code block
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
str
|
Modified code bloc with assert statements |
Source code in edsnlp/utils/blocs.py
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 |
|
remove_indentation(code, indent)
Remove indentation from a code bloc.
PARAMETER | DESCRIPTION |
---|---|
code |
Code bloc
TYPE:
|
indent |
Level of indentation
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
str
|
Modified code bloc |
Source code in edsnlp/utils/blocs.py
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 |
|
grab_code_blocks(docstring, lang='python')
Given a docstring, grab all the markdown codeblocks found in docstring.
PARAMETER | DESCRIPTION |
---|---|
docstring |
Full text.
TYPE:
|
lang |
Language to execute, by default "python"
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
List[str]
|
Extracted code blocks |
Source code in edsnlp/utils/blocs.py
100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 |
|
printer(code)
Prints a code bloc with lines for easier debugging.
PARAMETER | DESCRIPTION |
---|---|
code |
Code bloc.
TYPE:
|
Source code in edsnlp/utils/blocs.py
132 133 134 135 136 137 138 139 140 141 142 143 144 145 |
|
check_docstring(obj, lang='')
Given a function, test the contents of the docstring.
Source code in edsnlp/utils/blocs.py
148 149 150 151 152 153 154 155 156 157 158 |
|
check_raw_string(raw, lang='python')
Given a raw string, test the contents.
Source code in edsnlp/utils/blocs.py
161 162 163 164 165 166 167 168 169 170 |
|
check_raw_file_full(raw, lang='python')
Source code in edsnlp/utils/blocs.py
173 174 175 176 177 178 179 |
|
check_md_file(path, memory=False)
Given a markdown file, parse the contents for Python code blocs and check that each independant bloc does not cause an error.
PARAMETER | DESCRIPTION |
---|---|
path |
Path to the markdown file to execute.
TYPE:
|
memory |
Whether to keep results from one bloc to the next, by default
TYPE:
|
Source code in edsnlp/utils/blocs.py
182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 |
|