edsnlp.pipelines.core.normalizer
normalizer
Normalizer
Bases: object
Normalisation pipeline. Modifies the NORM
attribute,
acting on four dimensions :
lowercase
: using the defaultNORM
accents
: deterministic and fixed-length normalisation of accents.quotes
: deterministic and fixed-length normalisation of quotation marks.pollution
: removal of pollutions.
PARAMETER | DESCRIPTION |
---|---|
lowercase |
Whether to remove case.
TYPE:
|
accents |
Optional
TYPE:
|
quotes |
Optional
TYPE:
|
pollution |
Optional
TYPE:
|
Source code in edsnlp/pipelines/core/normalizer/normalizer.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 |
|
lowercase = lowercase
instance-attribute
accents = accents
instance-attribute
quotes = quotes
instance-attribute
pollution = pollution
instance-attribute
__init__(lowercase, accents, quotes, pollution)
Source code in edsnlp/pipelines/core/normalizer/normalizer.py
33 34 35 36 37 38 39 40 41 42 43 |
|
__call__(doc)
Apply the normalisation pipeline, one component at a time.
PARAMETER | DESCRIPTION |
---|---|
doc |
spaCy
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Doc
|
Doc object with |
Source code in edsnlp/pipelines/core/normalizer/normalizer.py
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 |
|
factory
DEFAULT_CONFIG = dict(accents=True, lowercase=True, quotes=True, pollution=True)
module-attribute
create_component(nlp, name, accents, lowercase, quotes, pollution)
Source code in edsnlp/pipelines/core/normalizer/factory.py
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 |
|
pollution
patterns
information = "(?s)(=====+\\s*)?(L\\s*e\\s*s\\sdonnées\\s*administratives,\\s*sociales\\s*|I?nfo\\s*rmation\\s*aux?\\s*patients?|L[’']AP-HP\\s*collecte\\s*vos\\s*données\\s*administratives|L[’']Assistance\\s*Publique\\s*-\\s*Hôpitaux\\s*de\\s*Paris\\s*\\(?AP-HP\\)?\\s*a\\s*créé\\s*une\\s*base\\s*de\\s*données).{,2000}https?:\\/\\/recherche\\.aphp\\.fr\\/eds\\/droit-opposition[\\s\\.]*"
module-attribute
bars = '(?i)([nbw]|_|-|=){5,}'
module-attribute
pollution = dict(information=information, bars=bars)
module-attribute
pollution
Pollution
Bases: BaseComponent
Tags pollution tokens.
Populates a number of spaCy extensions :
Token._.pollution
: indicates whether the token is a pollutionDoc._.clean
: lists non-pollution tokensDoc._.clean_
: original text with pollutions removed.Doc._.char_clean_span
: method to create a Span using character indices extracted using the cleaned text.
PARAMETER | DESCRIPTION |
---|---|
nlp |
Language pipeline object
TYPE:
|
pollution |
Dictionary containing regular expressions of pollution.
TYPE:
|
Source code in edsnlp/pipelines/core/normalizer/pollution/pollution.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 |
|
nlp = nlp
instance-attribute
pollution = pollution
instance-attribute
regex_matcher = RegexMatcher()
instance-attribute
__init__(nlp, pollution)
Source code in edsnlp/pipelines/core/normalizer/pollution/pollution.py
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
|
build_patterns()
Builds the patterns for phrase matching.
Source code in edsnlp/pipelines/core/normalizer/pollution/pollution.py
54 55 56 57 58 59 60 61 |
|
process(doc)
Find pollutions in doc and clean candidate negations to remove pseudo negations
PARAMETER | DESCRIPTION |
---|---|
doc |
spaCy Doc object
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
pollution
|
list of pollution spans |
Source code in edsnlp/pipelines/core/normalizer/pollution/pollution.py
63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 |
|
__call__(doc)
Tags pollutions.
PARAMETER | DESCRIPTION |
---|---|
doc |
spaCy Doc object
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
doc
|
spaCy Doc object, annotated for pollutions. |
Source code in edsnlp/pipelines/core/normalizer/pollution/pollution.py
83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 |
|
factory
DEFAULT_CONFIG = dict(pollution=None)
module-attribute
create_component(nlp, name, pollution)
Source code in edsnlp/pipelines/core/normalizer/pollution/factory.py
14 15 16 17 18 19 20 21 22 23 24 |
|
accents
patterns
accents: List[Tuple[str, str]] = [('ç', 'c'), ('àáâä', 'a'), ('èéêë', 'e'), ('ìíîï', 'i'), ('òóôö', 'o'), ('ùúûü', 'u')]
module-attribute
accents
Accents
Bases: object
Normalises accents, using a same-length strategy.
PARAMETER | DESCRIPTION |
---|---|
accents |
List of accentuated characters and their transcription.
TYPE:
|
Source code in edsnlp/pipelines/core/normalizer/accents/accents.py
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
|
translation_table = str.maketrans(''.join(accent_group for (accent_group, _) in accents), ''.join(rep * len(accent_group) for (accent_group, rep) in accents))
instance-attribute
__init__(accents)
Source code in edsnlp/pipelines/core/normalizer/accents/accents.py
18 19 20 21 22 23 24 25 |
|
__call__(doc)
Remove accents from spacy NORM
attribute.
PARAMETER | DESCRIPTION |
---|---|
doc |
The spaCy
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Doc
|
The document, with accents removed in |
Source code in edsnlp/pipelines/core/normalizer/accents/accents.py
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
|
factory
DEFAULT_CONFIG = dict(accents=None)
module-attribute
create_component(nlp, name, accents)
Source code in edsnlp/pipelines/core/normalizer/accents/factory.py
14 15 16 17 18 19 20 21 22 23 |
|
lowercase
factory
remove_lowercase(doc)
Add case on the NORM
custom attribute. Should always be applied first.
PARAMETER | DESCRIPTION |
---|---|
doc |
The spaCy
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Doc
|
The document, with case put back in |
Source code in edsnlp/pipelines/core/normalizer/lowercase/factory.py
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
|
quotes
quotes
Quotes
Bases: object
We normalise quotes, following this
source <https://www.cl.cam.ac.uk/~mgk25/ucs/quotes.html>
_.
PARAMETER | DESCRIPTION |
---|---|
quotes |
List of quotation characters and their transcription.
TYPE:
|
Source code in edsnlp/pipelines/core/normalizer/quotes/quotes.py
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
|
translation_table = str.maketrans(''.join(quote_group for (quote_group, _) in quotes), ''.join(rep * len(quote_group) for (quote_group, rep) in quotes))
instance-attribute
__init__(quotes)
Source code in edsnlp/pipelines/core/normalizer/quotes/quotes.py
19 20 21 22 23 24 25 26 |
|
__call__(doc)
Normalises quotes.
PARAMETER | DESCRIPTION |
---|---|
doc |
Document to process.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Doc
|
Same document, with quotes normalised. |
Source code in edsnlp/pipelines/core/normalizer/quotes/quotes.py
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
|
patterns
quotes: List[str] = ['"', '〃', 'ײ', '᳓', '″', '״', '‶', '˶', 'ʺ', '“', '”', '˝', '‟']
module-attribute
apostrophes: List[str] = ['`', '΄', ''', 'ˈ', 'ˊ', 'ᑊ', 'ˋ', 'ꞌ', 'ᛌ', '𖽒', '𖽑', '‘', '’', 'י', '՚', '‛', '՝', '`', '`', '′', '׳', '´', 'ʹ', '˴', 'ߴ', '‵', 'ߵ', 'ʹ', 'ʻ', 'ʼ', '´', '᾽', 'ʽ', '῾', 'ʾ', '᾿']
module-attribute
quotes_and_apostrophes: List[Tuple[str, str]] = [(''.join(quotes), '"'), (''.join(apostrophes), "'")]
module-attribute
factory
DEFAULT_CONFIG = dict(quotes=None)
module-attribute
create_component(nlp, name, quotes)
Source code in edsnlp/pipelines/core/normalizer/quotes/factory.py
14 15 16 17 18 19 20 21 22 23 |
|