Skip to content

edsnlp.pipelines.core.normalizer.normalizer

Normalizer

Bases: object

Normalisation pipeline. Modifies the NORM attribute, acting on four dimensions :

  • lowercase: using the default NORM
  • accents: deterministic and fixed-length normalisation of accents.
  • quotes: deterministic and fixed-length normalisation of quotation marks.
  • pollution: removal of pollutions.
PARAMETER DESCRIPTION
lowercase

Whether to remove case.

TYPE: bool

accents

Optional Accents object.

TYPE: Optional[Accents]

quotes

Optional Quotes object.

TYPE: Optional[Quotes]

pollution

Optional Pollution object.

TYPE: Optional[Pollution]

Source code in edsnlp/pipelines/core/normalizer/normalizer.py
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
class Normalizer(object):
    """
    Normalisation pipeline. Modifies the `NORM` attribute,
    acting on four dimensions :

    - `lowercase`: using the default `NORM`
    - `accents`: deterministic and fixed-length normalisation of accents.
    - `quotes`: deterministic and fixed-length normalisation of quotation marks.
    - `pollution`: removal of pollutions.

    Parameters
    ----------
    lowercase : bool
        Whether to remove case.
    accents : Optional[Accents]
        Optional `Accents` object.
    quotes : Optional[Quotes]
        Optional `Quotes` object.
    pollution : Optional[Pollution]
        Optional `Pollution` object.
    """

    def __init__(
        self,
        lowercase: bool,
        accents: Optional[Accents],
        quotes: Optional[Quotes],
        pollution: Optional[Pollution],
    ):
        self.lowercase = lowercase
        self.accents = accents
        self.quotes = quotes
        self.pollution = pollution

    def __call__(self, doc: Doc) -> Doc:
        """
        Apply the normalisation pipeline, one component at a time.

        Parameters
        ----------
        doc : Doc
            spaCy `Doc` object

        Returns
        -------
        Doc
            Doc object with `NORM` attribute modified
        """
        if not self.lowercase:
            remove_lowercase(doc)
        if self.accents is not None:
            self.accents(doc)
        if self.quotes is not None:
            self.quotes(doc)
        if self.pollution is not None:
            self.pollution(doc)

        return doc

lowercase = lowercase instance-attribute

accents = accents instance-attribute

quotes = quotes instance-attribute

pollution = pollution instance-attribute

__init__(lowercase, accents, quotes, pollution)

Source code in edsnlp/pipelines/core/normalizer/normalizer.py
33
34
35
36
37
38
39
40
41
42
43
def __init__(
    self,
    lowercase: bool,
    accents: Optional[Accents],
    quotes: Optional[Quotes],
    pollution: Optional[Pollution],
):
    self.lowercase = lowercase
    self.accents = accents
    self.quotes = quotes
    self.pollution = pollution

__call__(doc)

Apply the normalisation pipeline, one component at a time.

PARAMETER DESCRIPTION
doc

spaCy Doc object

TYPE: Doc

RETURNS DESCRIPTION
Doc

Doc object with NORM attribute modified

Source code in edsnlp/pipelines/core/normalizer/normalizer.py
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
def __call__(self, doc: Doc) -> Doc:
    """
    Apply the normalisation pipeline, one component at a time.

    Parameters
    ----------
    doc : Doc
        spaCy `Doc` object

    Returns
    -------
    Doc
        Doc object with `NORM` attribute modified
    """
    if not self.lowercase:
        remove_lowercase(doc)
    if self.accents is not None:
        self.accents(doc)
    if self.quotes is not None:
        self.quotes(doc)
    if self.pollution is not None:
        self.pollution(doc)

    return doc
Back to top