Skip to content

edsnlp.matchers

phrase

__doc__ = None

__file__ = '/home/basile/Documents/NLP/edsnlp/./edsnlp/matchers/phrase.cpython-39-x86_64-linux-gnu.so'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

__name__ = 'edsnlp.matchers.phrase'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

__package__ = 'edsnlp.matchers'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

__test__ = {}

dict() -> new empty dictionary dict(mapping) -> new dictionary initialized from a mapping object's (key, value) pairs dict(iterable) -> new dictionary initialized as if via: d = {} for k, v in iterable: d[k] = v dict(**kwargs) -> new dictionary initialized with the name=value pairs in the keyword argument list. For example: dict(one=1, two=2)

EDSPhraseMatcher

Bases: PhraseMatcher

PhraseMatcher that allows to skip excluded tokens. Adapted from https://github.com/explosion/spaCy/blob/master/spacy/matcher/phrasematcher.pyx

PARAMETER DESCRIPTION
vocab

spaCy vocabulary to match on.

TYPE: Vocab

attr

Default attribute to match on, by default "TEXT". Can be overridden in the add method. To match on a custom attribute, prepend the attribute name with _.

TYPE: str

ignore_excluded

Whether to ignore excluded tokens, by default True

TYPE: bool, optional

ignore_space_tokens

Whether to exclude tokens that have a "SPACE" tag, by default False

TYPE: bool, optional

__doc__ = '\n PhraseMatcher that allows to skip excluded tokens.\n Adapted from https://github.com/explosion/spaCy/blob/master/spacy/matcher/phrasematcher.pyx\n\n Parameters\n ----------\n vocab : Vocab\n spaCy vocabulary to match on.\n attr : str\n Default attribute to match on, by default "TEXT".\n Can be overridden in the `add` method.\n To match on a custom attribute, prepend the attribute name with `_`.\n ignore_excluded : bool, optional\n Whether to ignore excluded tokens, by default True\n ignore_space_tokens : bool, optional\n Whether to exclude tokens that have a "SPACE" tag, by default False\n '

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

__pyx_vtable__ = <capsule object NULL at 0x7fac5ba6f210>

Capsule objects let you wrap a C "void *" pointer in a Python object. They're a way of passing data through the Python interpreter without creating your own custom type.

Capsules are used for communication between extension modules. They provide a way for an extension module to export a C interface to other extension modules, so that extension modules can use the Python import mechanism to link to one another.

vocab = <attribute 'vocab' of 'spacy.matcher.phrasematcher.PhraseMatcher' objects>
__call__() method descriptor

Find all sequences matching the supplied patterns on the Doc.

doclike (Doc or Span): The document to match over. as_spans (bool): Return Span objects with labels instead of (match_id, start, end) tuples. RETURNS (list): A list of (match_id, start, end) tuples, describing the matches. A match tuple describes a span doc[start:end]. The match_id is an integer. If as_spans is set to True, a list of Span objects is returned.

DOCS: https://spacy.io/api/phrasematcher#call

__contains__() method descriptor

Check whether the matcher contains rules for a match ID.

key (str): The match ID. RETURNS (bool): Whether the matcher contains rules for this match ID.

DOCS: https://spacy.io/api/phrasematcher#contains

__delattr__(/, name) method descriptor

Implement delattr(self, name).

__dir__() method descriptor

Default dir() implementation.

__eq__(/, value) method descriptor

Return self==value.

__format__(/, format_spec) method descriptor

Default object formatter.

__ge__(/, value) method descriptor

Return self>=value.

__getattribute__(/, name) method descriptor

Return getattr(self, name).

__gt__(/, value) method descriptor

Return self>value.

__hash__() method descriptor

Return hash(self).

__init__() method descriptor

Initialize the PhraseMatcher.

vocab (Vocab): The shared vocabulary. attr (int / str): Token attribute to match on. validate (bool): Perform additional validation when patterns are added.

DOCS: https://spacy.io/api/phrasematcher#init

__init_subclass__() builtin

This method is called when a class is subclassed.

The default implementation does nothing. It may be overridden to extend subclasses.

__le__(/, value) method descriptor

Return self<=value.

__len__() method descriptor

Get the number of match IDs added to the matcher.

RETURNS (int): The number of rules.

DOCS: https://spacy.io/api/phrasematcher#len

__lt__(/, value) method descriptor

Return self<value.

__ne__(/, value) method descriptor

Return self!=value.

__new__(args, kwargs) builtin

Create and return a new object. See help(type) for accurate signature.

__reduce__() method descriptor

PhraseMatcher.reduce(self)

__reduce_cython__() method descriptor
__reduce_ex__(/, protocol) method descriptor

Helper for pickle.

__repr__() method descriptor

Return repr(self).

__setattr__(/, name, value) method descriptor

Implement setattr(self, name, value).

__setstate_cython__() method descriptor
__sizeof__() method descriptor

Size of object in memory, in bytes.

__str__() method descriptor

Return str(self).

__subclasshook__() builtin

Abstract classes can override this to customize issubclass().

This is invoked early on by abc.ABCMeta.subclasscheck(). It should return True, False or NotImplemented. If it returns NotImplemented, the normal algorithm is used. Otherwise, it overrides the normal algorithm (and the outcome is cached).

_convert_to_array() method descriptor

PhraseMatcher._convert_to_array(self, Doc doc)

add() method descriptor

PhraseMatcher.add(self, key, docs, *_docs, on_match=None) Add a match-rule to the phrase-matcher. A match-rule consists of: an ID key, an on_match callback, and one or more patterns.

    Since spaCy v2.2.2, PhraseMatcher.add takes a list of patterns as the
    second argument, with the on_match callback as an optional keyword
    argument.

    key (str): The match ID.
    docs (list): List of `Doc` objects representing match patterns.
    on_match (callable): Callback executed on match.
    *_docs (Doc): For backwards compatibility: list of patterns to add
        as variable arguments. Will be ignored if a list of patterns is
        provided as the second argument.

    DOCS: https://spacy.io/api/phrasematcher#add
build_patterns() method descriptor

Build patterns and adds them for matching. Helper function for pipelines using this matcher.

PARAMETER DESCRIPTION
nlp

The instance of the spaCy language class.

TYPE: Language

terms

Dictionary of label/terms, or label/dictionary of terms/attribute.

TYPE: Patterns

pipe() method descriptor

PhraseMatcher.pipe(self, stream, batch_size=1000, return_matches=False, as_tuples=False) Match a stream of documents, yielding them in turn. Deprecated as of spaCy v3.0.

remove() method descriptor

PhraseMatcher.remove(self, key) Remove a rule from the matcher by match ID. A KeyError is raised if the key does not exist.

    key (str): The match ID.

    DOCS: https://spacy.io/api/phrasematcher#remove
set_extensions() classmethod

__pyx_unpickle_Enum() builtin

get_normalized_variant() builtin

regex

RegexMatcher

Bases: object

Simple RegExp matcher.

PARAMETER DESCRIPTION
alignment_mode

How spans should be aligned with tokens. Possible values are strict (character indices must be aligned with token boundaries), "contract" (span of all tokens completely within the character span), "expand" (span of all tokens at least partially covered by the character span). Defaults to expand.

TYPE: str

attr

Default attribute to match on, by default "TEXT". Can be overiden in the add method.

TYPE: str

ignore_excluded

Whether to skip exclusions

TYPE: bool

Source code in edsnlp/matchers/regex.py
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
class RegexMatcher(object):
    """
    Simple RegExp matcher.

    Parameters
    ----------
    alignment_mode : str
        How spans should be aligned with tokens.
        Possible values are `strict` (character indices must be aligned
        with token boundaries), "contract" (span of all tokens completely
        within the character span), "expand" (span of all tokens at least
        partially covered by the character span).
        Defaults to `expand`.
    attr : str
        Default attribute to match on, by default "TEXT".
        Can be overiden in the `add` method.
    ignore_excluded : bool
        Whether to skip exclusions
    """

    def __init__(
        self,
        alignment_mode: str = "expand",
        attr: str = "TEXT",
        ignore_excluded: bool = False,
    ):
        self.alignment_mode = alignment_mode
        self.regex = []

        self.default_attr = attr

        self.ignore_excluded = ignore_excluded

    def build_patterns(self, regex: Patterns):
        """
        Build patterns and adds them for matching.
        Helper function for pipelines using this matcher.

        Parameters
        ----------
        regex : Patterns
            Dictionary of label/terms, or label/dictionary of terms/attribute.
        """
        if not regex:
            regex = dict()

        for key, patterns in regex.items():
            if isinstance(patterns, dict):
                attr = patterns.get("attr")
                alignment_mode = patterns.get("alignment_mode")
                patterns = patterns.get("regex")
            else:
                attr = None
                alignment_mode = None

            if isinstance(patterns, str):
                patterns = [patterns]

            self.add(
                key=key, patterns=patterns, attr=attr, alignment_mode=alignment_mode
            )

    def add(
        self,
        key: str,
        patterns: List[str],
        attr: Optional[str] = None,
        ignore_excluded: Optional[bool] = None,
        alignment_mode: Optional[str] = None,
    ):
        """
        Add a pattern to the registry.

        Parameters
        ----------
        key : str
            Key of the new/updated pattern.
        patterns : List[str]
            List of patterns to add.
        attr : str, optional
            Attribute to use for matching.
            By default uses the `default_attr` attribute
        ignore_excluded : bool, optional
            Whether to skip excluded tokens during matching.
        alignment_mode : str, optional
            Overwrite alignment mode.
        """

        if attr is None:
            attr = self.default_attr

        if ignore_excluded is None:
            ignore_excluded = self.ignore_excluded

        if alignment_mode is None:
            alignment_mode = self.alignment_mode

        patterns = [compile_regex(pattern) for pattern in patterns]

        self.regex.append((key, patterns, attr, ignore_excluded, alignment_mode))

    def remove(
        self,
        key: str,
    ):
        """
        Remove a pattern for the registry.

        Parameters
        ----------
        key : str
            key of the pattern to remove.

        Raises
        ------
        ValueError
            If the key is not present in the registered patterns.
        """
        n = len(self.regex)
        self.regex = [(k, p, a, i, am) for k, p, a, i, am in self.regex if k != key]
        if len(self.regex) == n:
            raise ValueError(f"`{key}` is not referenced in the matcher")

    def __len__(self):
        return len(set([regex[0] for regex in self.regex]))

    def match(
        self,
        doclike: Union[Doc, Span],
    ) -> Tuple[Span, re.Match]:
        """
        Iterates on the matches.

        Parameters
        ----------
        doclike:
            spaCy Doc or Span object to match on.

        Yields
        -------
        span:
            A match.
        """

        for key, patterns, attr, ignore_excluded, alignment_mode in self.regex:
            text = get_text(doclike, attr, ignore_excluded)

            for pattern in patterns:
                for match in pattern.finditer(text):
                    logger.trace(f"Matched a regex from {key}: {repr(match.group())}")

                    span = create_span(
                        doclike=doclike,
                        start_char=match.start(),
                        end_char=match.end(),
                        key=key,
                        attr=attr,
                        alignment_mode=alignment_mode,
                        ignore_excluded=ignore_excluded,
                    )

                    if span is None:
                        continue

                    yield span, match

    def __call__(
        self,
        doclike: Union[Doc, Span],
        as_spans=False,
        return_groupdict=False,
    ) -> Union[Span, Tuple[Span, Dict[str, Any]]]:
        """
        Performs matching. Yields matches.

        Parameters
        ----------
        doclike:
            spaCy Doc or Span object.
        as_spans:
            Returns matches as spans.

        Yields
        ------
        span:
            A match.
        groupdict:
            Additional information coming from the named patterns
            in the regular expression.
        """
        for span, match in self.match(doclike):
            if not as_spans:
                offset = doclike[0].i
                span = (span.label, span.start - offset, span.end - offset)
            if return_groupdict:
                yield span, match.groupdict()
            else:
                yield span
alignment_mode = alignment_mode instance-attribute
regex = [] instance-attribute
default_attr = attr instance-attribute
ignore_excluded = ignore_excluded instance-attribute
__init__(alignment_mode='expand', attr='TEXT', ignore_excluded=False)
Source code in edsnlp/matchers/regex.py
135
136
137
138
139
140
141
142
143
144
145
146
def __init__(
    self,
    alignment_mode: str = "expand",
    attr: str = "TEXT",
    ignore_excluded: bool = False,
):
    self.alignment_mode = alignment_mode
    self.regex = []

    self.default_attr = attr

    self.ignore_excluded = ignore_excluded
build_patterns(regex)

Build patterns and adds them for matching. Helper function for pipelines using this matcher.

PARAMETER DESCRIPTION
regex

Dictionary of label/terms, or label/dictionary of terms/attribute.

TYPE: Patterns

Source code in edsnlp/matchers/regex.py
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
def build_patterns(self, regex: Patterns):
    """
    Build patterns and adds them for matching.
    Helper function for pipelines using this matcher.

    Parameters
    ----------
    regex : Patterns
        Dictionary of label/terms, or label/dictionary of terms/attribute.
    """
    if not regex:
        regex = dict()

    for key, patterns in regex.items():
        if isinstance(patterns, dict):
            attr = patterns.get("attr")
            alignment_mode = patterns.get("alignment_mode")
            patterns = patterns.get("regex")
        else:
            attr = None
            alignment_mode = None

        if isinstance(patterns, str):
            patterns = [patterns]

        self.add(
            key=key, patterns=patterns, attr=attr, alignment_mode=alignment_mode
        )
add(key, patterns, attr=None, ignore_excluded=None, alignment_mode=None)

Add a pattern to the registry.

PARAMETER DESCRIPTION
key

Key of the new/updated pattern.

TYPE: str

patterns

List of patterns to add.

TYPE: List[str]

attr

Attribute to use for matching. By default uses the default_attr attribute

TYPE: str, optional DEFAULT: None

ignore_excluded

Whether to skip excluded tokens during matching.

TYPE: bool, optional DEFAULT: None

alignment_mode

Overwrite alignment mode.

TYPE: str, optional DEFAULT: None

Source code in edsnlp/matchers/regex.py
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
def add(
    self,
    key: str,
    patterns: List[str],
    attr: Optional[str] = None,
    ignore_excluded: Optional[bool] = None,
    alignment_mode: Optional[str] = None,
):
    """
    Add a pattern to the registry.

    Parameters
    ----------
    key : str
        Key of the new/updated pattern.
    patterns : List[str]
        List of patterns to add.
    attr : str, optional
        Attribute to use for matching.
        By default uses the `default_attr` attribute
    ignore_excluded : bool, optional
        Whether to skip excluded tokens during matching.
    alignment_mode : str, optional
        Overwrite alignment mode.
    """

    if attr is None:
        attr = self.default_attr

    if ignore_excluded is None:
        ignore_excluded = self.ignore_excluded

    if alignment_mode is None:
        alignment_mode = self.alignment_mode

    patterns = [compile_regex(pattern) for pattern in patterns]

    self.regex.append((key, patterns, attr, ignore_excluded, alignment_mode))
remove(key)

Remove a pattern for the registry.

PARAMETER DESCRIPTION
key

key of the pattern to remove.

TYPE: str

RAISES DESCRIPTION
ValueError

If the key is not present in the registered patterns.

Source code in edsnlp/matchers/regex.py
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
def remove(
    self,
    key: str,
):
    """
    Remove a pattern for the registry.

    Parameters
    ----------
    key : str
        key of the pattern to remove.

    Raises
    ------
    ValueError
        If the key is not present in the registered patterns.
    """
    n = len(self.regex)
    self.regex = [(k, p, a, i, am) for k, p, a, i, am in self.regex if k != key]
    if len(self.regex) == n:
        raise ValueError(f"`{key}` is not referenced in the matcher")
__len__()
Source code in edsnlp/matchers/regex.py
238
239
def __len__(self):
    return len(set([regex[0] for regex in self.regex]))
match(doclike)

Iterates on the matches.

PARAMETER DESCRIPTION
doclike

spaCy Doc or Span object to match on.

TYPE: Union[Doc, Span]

YIELDS DESCRIPTION
span

A match.

Source code in edsnlp/matchers/regex.py
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
def match(
    self,
    doclike: Union[Doc, Span],
) -> Tuple[Span, re.Match]:
    """
    Iterates on the matches.

    Parameters
    ----------
    doclike:
        spaCy Doc or Span object to match on.

    Yields
    -------
    span:
        A match.
    """

    for key, patterns, attr, ignore_excluded, alignment_mode in self.regex:
        text = get_text(doclike, attr, ignore_excluded)

        for pattern in patterns:
            for match in pattern.finditer(text):
                logger.trace(f"Matched a regex from {key}: {repr(match.group())}")

                span = create_span(
                    doclike=doclike,
                    start_char=match.start(),
                    end_char=match.end(),
                    key=key,
                    attr=attr,
                    alignment_mode=alignment_mode,
                    ignore_excluded=ignore_excluded,
                )

                if span is None:
                    continue

                yield span, match
__call__(doclike, as_spans=False, return_groupdict=False)

Performs matching. Yields matches.

PARAMETER DESCRIPTION
doclike

spaCy Doc or Span object.

TYPE: Union[Doc, Span]

as_spans

Returns matches as spans.

DEFAULT: False

YIELDS DESCRIPTION
span

A match.

groupdict

Additional information coming from the named patterns in the regular expression.

Source code in edsnlp/matchers/regex.py
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
def __call__(
    self,
    doclike: Union[Doc, Span],
    as_spans=False,
    return_groupdict=False,
) -> Union[Span, Tuple[Span, Dict[str, Any]]]:
    """
    Performs matching. Yields matches.

    Parameters
    ----------
    doclike:
        spaCy Doc or Span object.
    as_spans:
        Returns matches as spans.

    Yields
    ------
    span:
        A match.
    groupdict:
        Additional information coming from the named patterns
        in the regular expression.
    """
    for span, match in self.match(doclike):
        if not as_spans:
            offset = doclike[0].i
            span = (span.label, span.start - offset, span.end - offset)
        if return_groupdict:
            yield span, match.groupdict()
        else:
            yield span

get_first_included(doclike)

Source code in edsnlp/matchers/regex.py
13
14
15
16
17
18
@lru_cache(32)
def get_first_included(doclike: Union[Doc, Span]) -> Token:
    for token in doclike:
        if not token._.excluded:
            return token
    raise IndexError("The provided Span does not include any token")

create_span(doclike, start_char, end_char, key, attr, alignment_mode, ignore_excluded)

spaCy only allows strict alignment mode for char_span on Spans. This method circumvents this.

PARAMETER DESCRIPTION
doclike

Doc or Span.

TYPE: Union[Doc, Span]

start_char

Character index within the Doc-like object.

TYPE: int

end_char

Character index of the end, within the Doc-like object.

TYPE: int

key

The key used to match.

TYPE: str

alignment_mode

The alignment mode.

TYPE: str

ignore_excluded

Whether to skip excluded tokens.

TYPE: bool

RETURNS DESCRIPTION
span

A span matched on the Doc-like object.

Source code in edsnlp/matchers/regex.py
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
def create_span(
    doclike: Union[Doc, Span],
    start_char: int,
    end_char: int,
    key: str,
    attr: str,
    alignment_mode: str,
    ignore_excluded: bool,
) -> Span:
    """
    spaCy only allows strict alignment mode for char_span on Spans.
    This method circumvents this.

    Parameters
    ----------
    doclike : Union[Doc, Span]
        `Doc` or `Span`.
    start_char : int
        Character index within the Doc-like object.
    end_char : int
        Character index of the end, within the Doc-like object.
    key : str
        The key used to match.
    alignment_mode : str
        The alignment mode.
    ignore_excluded : bool
        Whether to skip excluded tokens.

    Returns
    -------
    span:
        A span matched on the Doc-like object.
    """

    doc = doclike if isinstance(doclike, Doc) else doclike.doc

    # Handle the simple case immediately
    if attr in {"TEXT", "LOWER"} and not ignore_excluded:
        off = doclike[0].idx
        return doc.char_span(
            start_char + off,
            end_char + off,
            label=key,
            alignment_mode=alignment_mode,
        )

    # If doclike is a Span, we need to get the clean
    # index of the first included token
    if ignore_excluded:
        original, clean = alignment(
            doc=doc,
            attr=attr,
            ignore_excluded=ignore_excluded,
        )

        first_included = get_first_included(doclike)
        i = bisect_left(original, first_included.idx)
        first = clean[i]

    else:
        first = doclike[0].idx

    start_char = (
        first
        + start_char
        + offset(
            doc,
            attr=attr,
            ignore_excluded=ignore_excluded,
            index=first + start_char,
        )
    )

    end_char = (
        first
        + end_char
        + offset(
            doc,
            attr=attr,
            ignore_excluded=ignore_excluded,
            index=first + end_char,
        )
    )

    span = doc.char_span(
        start_char,
        end_char,
        label=key,
        alignment_mode=alignment_mode,
    )

    return span

utils

ListOrStr = Union[List[str], str] module-attribute

DictOrPattern = Union[Dict[str, ListOrStr], ListOrStr] module-attribute

Patterns = Dict[str, DictOrPattern] module-attribute

ATTRIBUTES = {'LOWER': 'lower_', 'TEXT': 'text', 'NORM': 'norm_', 'SHAPE': 'shape_'} module-attribute

offset

token_length(token, custom, attr)
Source code in edsnlp/matchers/utils/offset.py
10
11
12
13
14
15
def token_length(token: Token, custom: bool, attr: str):
    if custom:
        text = getattr(token._, attr)
    else:
        text = getattr(token, attr)
    return len(text)
alignment(doc, attr='TEXT', ignore_excluded=True)

Align different representations of a Doc or Span object.

PARAMETER DESCRIPTION
doc

spaCy Doc or Span object

TYPE: Doc

attr

Attribute to use, by default "TEXT"

TYPE: str, optional DEFAULT: 'TEXT'

ignore_excluded

Whether to remove excluded tokens, by default True

TYPE: bool, optional DEFAULT: True

RETURNS DESCRIPTION
Tuple[List[int], List[int]]

An alignment tuple: original and clean lists.

Source code in edsnlp/matchers/utils/offset.py
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
@lru_cache(maxsize=32)
def alignment(
    doc: Doc,
    attr: str = "TEXT",
    ignore_excluded: bool = True,
) -> Tuple[List[int], List[int]]:
    """
    Align different representations of a `Doc` or `Span` object.

    Parameters
    ----------
    doc : Doc
        spaCy `Doc` or `Span` object
    attr : str, optional
        Attribute to use, by default `"TEXT"`
    ignore_excluded : bool, optional
        Whether to remove excluded tokens, by default True

    Returns
    -------
    Tuple[List[int], List[int]]
        An alignment tuple: original and clean lists.
    """
    assert isinstance(doc, Doc)

    attr = attr.upper()
    attr = ATTRIBUTES.get(attr, attr)

    custom = attr.startswith("_")

    if custom:
        attr = attr[1:].lower()

    # Define the length function
    length = partial(token_length, custom=custom, attr=attr)

    original = []
    clean = []

    cursor = 0

    for token in doc:

        if not ignore_excluded or not token._.excluded:

            # The token is not excluded, we add its extremities to the list
            original.append(token.idx)

            # We add the cursor
            clean.append(cursor)
            cursor += length(token)

            if token.whitespace_:
                cursor += 1

    return original, clean
offset(doc, attr, ignore_excluded, index)

Compute offset between the original text and a given representation (defined by the couple attr, ignore_excluded).

The alignment itself is computed with alignment.

PARAMETER DESCRIPTION
doc

The spaCy Doc object

TYPE: Doc

attr

The attribute used by the RegexMatcher (eg NORM)

TYPE: str

ignore_excluded

Whether the RegexMatcher ignores excluded tokens.

TYPE: bool

index

The index in the pre-processed text.

TYPE: int

RETURNS DESCRIPTION
int

The offset. To get the character index in the original document, just do: original = index + offset(doc, attr, ignore_excluded, index)

Source code in edsnlp/matchers/utils/offset.py
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def offset(
    doc: Doc,
    attr: str,
    ignore_excluded: bool,
    index: int,
) -> int:
    """
    Compute offset between the original text and a given representation
    (defined by the couple `attr`, `ignore_excluded`).

    The alignment itself is computed with
    [`alignment`][edsnlp.matchers.utils.offset.alignment].

    Parameters
    ----------
    doc : Doc
        The spaCy `Doc` object
    attr : str
        The attribute used by the [`RegexMatcher`][edsnlp.matchers.regex.RegexMatcher]
        (eg `NORM`)
    ignore_excluded : bool
        Whether the RegexMatcher ignores excluded tokens.
    index : int
        The index in the pre-processed text.

    Returns
    -------
    int
        The offset. To get the character index in the original document,
        just do: `#!python original = index + offset(doc, attr, ignore_excluded, index)`
    """
    original, clean = alignment(
        doc=doc,
        attr=attr,
        ignore_excluded=ignore_excluded,
    )

    # We use bisect to efficiently find the correct rightmost-lower index
    i = bisect_left(clean, index)
    i = min(i, len(original) - 1)

    return original[i] - clean[i]

text

get_text(doclike, attr, ignore_excluded)

Get text using a custom attribute, possibly ignoring excluded tokens.

PARAMETER DESCRIPTION
doclike

Doc or Span to get text from.

TYPE: Union[Doc, Span]

attr

Attribute to use.

TYPE: str

ignore_excluded

Whether to skip excluded tokens, by default False

TYPE: bool

RETURNS DESCRIPTION
str

Extracted text.

Source code in edsnlp/matchers/utils/text.py
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
@lru_cache(32)
def get_text(
    doclike: Union[Doc, Span],
    attr: str,
    ignore_excluded: bool,
) -> str:
    """
    Get text using a custom attribute, possibly ignoring excluded tokens.

    Parameters
    ----------
    doclike : Union[Doc, Span]
        Doc or Span to get text from.
    attr : str
        Attribute to use.
    ignore_excluded : bool
        Whether to skip excluded tokens, by default False

    Returns
    -------
    str
        Extracted text.
    """

    attr = attr.upper()

    if not ignore_excluded:
        if attr == "TEXT":
            return doclike.text
        elif attr == "LOWER":
            return doclike.text.lower()
        else:
            tokens = doclike
    else:
        tokens = [t for t in doclike if not t._.excluded]

    attr = ATTRIBUTES.get(attr, attr)

    if attr.startswith("_"):
        attr = attr[1:].lower()
        return "".join([getattr(t._, attr) + t.whitespace_ for t in tokens])
    else:
        return "".join([getattr(t, attr) + t.whitespace_ for t in tokens])
Back to top