Skip to content

edsnlp.pipelines.core.terminology.factory

create_component(nlp, label, terms, name='eds.terminology', attr='TEXT', regex=None, ignore_excluded=False, ignore_space_tokens=False, term_matcher='exact', term_matcher_config={})

Provides a terminology matching component.

The terminology matching component differs from the simple matcher component in that the regex and terms keys are used as spaCy's kb_id. All matched entities have the same label, defined in the top-level constructor (argument label).

PARAMETER DESCRIPTION
nlp

The spaCy object.

TYPE: Language

name

The name of the component.

TYPE: str DEFAULT: 'eds.terminology'

label

Top-level label

TYPE: str

terms

A dictionary of terms.

TYPE: Optional[Patterns]

regex

A dictionary of regular expressions.

TYPE: Optional[Patterns] DEFAULT: None

attr

The default attribute to use for matching. Can be overridden using the terms and regex configurations.

TYPE: str DEFAULT: 'TEXT'

ignore_excluded

Whether to skip excluded tokens (requires an upstream pipeline to mark excluded tokens).

TYPE: bool DEFAULT: False

ignore_space_tokens

Whether to skip space tokens during matching.

TYPE: bool DEFAULT: False

term_matcher

The matcher to use for matching phrases ? One of (exact, simstring)

TYPE: TerminologyTermMatcher DEFAULT: 'exact'

term_matcher_config

Parameters of the matcher class

TYPE: Dict[str, Any] DEFAULT: {}

Source code in edsnlp/pipelines/core/terminology/factory.py
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
@Language.factory(
    "eds.terminology",
    default_config=DEFAULT_CONFIG,
    assigns=["doc.ents", "doc.spans"],
)
def create_component(
    nlp: Language,
    label: str,
    terms: Optional[Dict[str, Union[str, List[str]]]],
    name: str = "eds.terminology",
    attr: Union[str, Dict[str, str]] = "TEXT",
    regex: Optional[Dict[str, Union[str, List[str]]]] = None,
    ignore_excluded: bool = False,
    ignore_space_tokens: bool = False,
    term_matcher: TerminologyTermMatcher = "exact",
    term_matcher_config: Dict[str, Any] = {},
):
    """
    Provides a terminology matching component.

    The terminology matching component differs from the simple matcher component in that
    the `regex` and `terms` keys are used as spaCy's `kb_id`. All matched entities
    have the same label, defined in the top-level constructor (argument `label`).

    Parameters
    ----------
    nlp : Language
        The spaCy object.
    name: str
        The name of the component.
    label : str
        Top-level label
    terms : Optional[Patterns]
        A dictionary of terms.
    regex : Optional[Patterns]
        A dictionary of regular expressions.
    attr : str
        The default attribute to use for matching.
        Can be overridden using the `terms` and `regex` configurations.
    ignore_excluded : bool
        Whether to skip excluded tokens (requires an upstream
        pipeline to mark excluded tokens).
    ignore_space_tokens: bool
        Whether to skip space tokens during matching.
    term_matcher: TerminologyTermMatcher
        The matcher to use for matching phrases ?
        One of (exact, simstring)
    term_matcher_config: Dict[str,Any]
        Parameters of the matcher class
    """
    assert not (terms is None and regex is None)

    return TerminologyMatcher(
        nlp,
        label=label,
        terms=terms or dict(),
        attr=attr,
        regex=regex or dict(),
        ignore_excluded=ignore_excluded,
        ignore_space_tokens=ignore_space_tokens,
        term_matcher=term_matcher,
        term_matcher_config=term_matcher_config,
    )