`edsnlp.utils`

`inclusion`

`check_inclusion(span, start, end)`

Checks whether the span overlaps the boundaries.

PARAMETER DESCRIPTION

span

Span to check.

TYPE: Span

start

Start of the boundary

TYPE: int

end

End of the boundary

TYPE: int

RETURNS	DESCRIPTION
`bool`	Whether the span overlaps the boundaries.

Source code in edsnlp/utils/inclusion.py

def check_inclusion(span: Span, start: int, end: int) -> bool:
    """
    Checks whether the span overlaps the boundaries.

    Parameters
    ----------
    span : Span
        Span to check.
    start : int
        Start of the boundary
    end : int
        End of the boundary

    Returns
    -------
    bool
        Whether the span overlaps the boundaries.
    """

    if span.start >= end or span.end <= start:
        return False
    return True

`filter`

`default_sort_key(span)`

Returns the sort key for filtering spans.

PARAMETER DESCRIPTION

span

Span to sort.

TYPE: Span

RETURNS DESCRIPTION

key

Sort key.

TYPE: Tuple(int, int)

Source code in edsnlp/utils/filter.py

def default_sort_key(span: Span) -> Tuple[int, int]:
    """
    Returns the sort key for filtering spans.

    Parameters
    ----------
    span : Span
        Span to sort.

    Returns
    -------
    key : Tuple(int, int)
        Sort key.
    """
    return span.end - span.start, -span.start

`filter_spans(spans, label_to_remove=None, return_discarded=False, sort_key=default_sort_key)`

Re-definition of spacy's filtering function, that returns discarded spans as well as filtered ones.

Can also accept a label_to_remove argument, useful for filtering out pseudo cues. If set, results can contain overlapping spans: only spans overlapping with excluded labels are removed. The main expected use case is for pseudo-cues.

The spaCy documentation states:

Filter a sequence of spans and remove duplicates or overlaps. Useful for creating named entities (where one token can only be part of one entity) or when merging spans with Retokenizer.merge. When spans overlap, the (first) longest span is preferred over shorter spans.

Filtering out spans

If the label_to_remove argument is supplied, it might be tempting to filter overlapping spans that are not part of a label to remove.

The reason we keep all other possibly overlapping labels is that in qualifier pipelines, the same cue can precede and follow a marked entity. Hence we need to keep every example.

PARAMETER	DESCRIPTION
`spans`	Spans to filter. TYPE: `List[Span]`
`return_discarded`	Whether to return discarded spans. TYPE: `bool` DEFAULT: `False`
`label_to_remove`	Label to remove. If set, results can contain overlapping spans. TYPE: `str, optional` DEFAULT: `None`
`sort_key`	Key to sorting spans before applying overlap conflict resolution. A span with a higher key will have precedence over another span. By default, the largest, leftmost spans are selected first. TYPE: `Callable[Span, Any], optional` DEFAULT: `default_sort_key`

RETURNS DESCRIPTION

results

Filtered spans

TYPE: List[Span]

discarded

Discarded spans

TYPE: List[Span], optional

Source code in edsnlp/utils/filter.py

def filter_spans(
    spans: Iterable[Union["Span", Tuple["Span", Any]]],
    label_to_remove: Optional[str] = None,
    return_discarded: bool = False,
    sort_key: Callable[[Span], Any] = default_sort_key,
) -> Union[List["Span"], Tuple[List["Span"], List["Span"]]]:
    """
    Re-definition of spacy's filtering function, that returns discarded spans
    as well as filtered ones.

    Can also accept a `label_to_remove` argument, useful for filtering out
    pseudo cues. If set, `results` can contain overlapping spans: only
    spans overlapping with excluded labels are removed. The main expected
    use case is for pseudo-cues.

    !!! note ""

        The **spaCy documentation states**:

        > Filter a sequence of spans and remove duplicates or overlaps.
        > Useful for creating named entities (where one token can only
        > be part of one entity) or when merging spans with
        > `Retokenizer.merge`. When spans overlap, the (first)
        > longest span is preferred over shorter spans.

    !!! danger "Filtering out spans"

        If the `label_to_remove` argument is supplied, it might be tempting to
        filter overlapping spans that are not part of a label to remove.

        The reason we keep all other possibly overlapping labels is that in qualifier
        pipelines, the same cue can precede **and** follow a marked entity.
        Hence we need to keep every example.

    Parameters
    ----------
    spans : List[Span]
        Spans to filter.
    return_discarded : bool
        Whether to return discarded spans.
    label_to_remove : str, optional
        Label to remove. If set, results can contain overlapping spans.
    sort_key : Callable[Span, Any], optional
        Key to sorting spans before applying overlap conflict resolution.
        A span with a higher key will have precedence over another span.
        By default, the largest, leftmost spans are selected first.

    Returns
    -------
    results : List[Span]
        Filtered spans
    discarded : List[Span], optional
        Discarded spans
    """
    sorted_spans = sorted(spans, key=sort_key, reverse=True)
    result = []
    discarded = []
    seen_tokens = set()
    for span in sorted_spans:
        # Check for end - 1 here because boundaries are inclusive
        if span.start not in seen_tokens and span.end - 1 not in seen_tokens:
            if label_to_remove is None or span.label_ != label_to_remove:
                result.append(span)
            if label_to_remove is None or span.label_ == label_to_remove:
                seen_tokens.update(range(span.start, span.end))
        elif label_to_remove is None or span.label_ != label_to_remove:
            discarded.append(span)

    result = sorted(result, key=lambda span: span.start)
    discarded = sorted(discarded, key=lambda span: span.start)

    if return_discarded:
        return result, discarded

    return result

`consume_spans(spans, filter, second_chance=None)`

Consume a list of span, according to a filter.

Warning

This method makes the hard hypothesis that:

Spans are sorted.
Spans are consumed in sequence and only once.

The second item is problematic for the way we treat long entities, hence the second_chance parameter, which lets entities be seen more than once.

PARAMETER DESCRIPTION

spans

List of spans to filter

TYPE: List of spans

filter

Filtering function. Should return True when the item is to be included.

TYPE: Callable

second_chance

Optional list of spans to include again (useful for long entities), by default None

TYPE: List of spans, optional DEFAULT: None

RETURNS DESCRIPTION

matches

List of spans consumed by the filter.

TYPE: List of spans

remainder

List of remaining spans in the original spans parameter.

TYPE: List of spans

Source code in edsnlp/utils/filter.py

def consume_spans(
    spans: List[Span],
    filter: Callable,
    second_chance: Optional[List[Span]] = None,
) -> Tuple[List[Span], List[Span]]:
    """
    Consume a list of span, according to a filter.

    !!! warning
        This method makes the hard hypothesis that:

        1. Spans are sorted.
        2. Spans are consumed in sequence and only once.

        The second item is problematic for the way we treat long entities,
        hence the `second_chance` parameter, which lets entities be seen
        more than once.

    Parameters
    ----------
    spans : List of spans
        List of spans to filter
    filter : Callable
        Filtering function. Should return True when the item is to be included.
    second_chance : List of spans, optional
        Optional list of spans to include again (useful for long entities),
        by default None

    Returns
    -------
    matches : List of spans
        List of spans consumed by the filter.
    remainder : List of spans
        List of remaining spans in the original `spans` parameter.
    """

    if not second_chance:
        second_chance = []
    else:
        second_chance = [m for m in second_chance if filter(m)]

    if not spans:
        return second_chance, []

    for i, span in enumerate(spans):
        if not filter(span):
            break
        else:
            i += 1

    matches = spans[:i]
    remainder = spans[i:]

    matches.extend(second_chance)

    return matches, remainder

`get_spans(spans, label)`

Extracts spans with a given label. Prefer using hash label for performance reasons.

PARAMETER DESCRIPTION

spans

List of spans to filter.

TYPE: List[Span]

label

Label to filter on.

TYPE: Union[int, str]

RETURNS	DESCRIPTION
`List[Span]`	Filtered spans.

Source code in edsnlp/utils/filter.py

def get_spans(spans: List[Span], label: Union[int, str]) -> List[Span]:
    """
    Extracts spans with a given label.
    Prefer using hash label for performance reasons.

    Parameters
    ----------
    spans : List[Span]
        List of spans to filter.
    label : Union[int, str]
        Label to filter on.

    Returns
    -------
    List[Span]
        Filtered spans.
    """
    if isinstance(label, int):
        return [span for span in spans if span.label == label]
    else:
        return [span for span in spans if span.label_ == label]

`resources`

`get_verbs(verbs=None, check_contains=True)`

Extract verbs from the resources, as a pandas dataframe.

PARAMETER DESCRIPTION

verbs

List of verbs to keep. Returns all verbs by default.

TYPE: List[str], optional DEFAULT: None

check_contains

Whether to check that no verb is missing if a list of verbs was provided. By default True

TYPE: bool, optional DEFAULT: True

RETURNS	DESCRIPTION
`pd.DataFrame`	DataFrame containing conjugated verbs.

Source code in edsnlp/utils/resources.py

def get_verbs(
    verbs: Optional[List[str]] = None, check_contains: bool = True
) -> pd.DataFrame:
    """
    Extract verbs from the resources, as a pandas dataframe.

    Parameters
    ----------
    verbs : List[str], optional
        List of verbs to keep. Returns all verbs by default.
    check_contains : bool, optional
        Whether to check that no verb is missing if a list of verbs was provided.
        By default True

    Returns
    -------
    pd.DataFrame
        DataFrame containing conjugated verbs.
    """

    conjugated_verbs = pd.read_csv(BASE_DIR / "resources" / "verbs.csv")

    if not verbs:
        return conjugated_verbs

    verbs = set(verbs)

    selected_verbs = conjugated_verbs[conjugated_verbs.verb.isin(verbs)]

    if check_contains:
        assert len(verbs) == selected_verbs.verb.nunique(), "Some verbs are missing !"

    return selected_verbs

`examples`

`entity_pattern = re.compile('(<ent[^<>]*>[^<>]+</ent>)')` `module-attribute`

`text_pattern = re.compile('<ent.*>(.+)</ent>')` `module-attribute`

`modifiers_pattern = re.compile('<ent\\s?(.*)>.+</ent>')` `module-attribute`

`Match`

Bases: BaseModel

Source code in edsnlp/utils/examples.py

class Match(BaseModel):
    start_char: int
    end_char: int
    text: str
    modifiers: str

`start_char: int = None` `class-attribute`

`end_char: int = None` `class-attribute`

`text: str = None` `class-attribute`

`modifiers: str = None` `class-attribute`

`Modifier`

Bases: BaseModel

Source code in edsnlp/utils/examples.py

14
15
16

class Modifier(BaseModel):
    key: str
    value: Union[int, float, bool, str]

`key: str = None` `class-attribute`

`value: Union[int, float, bool, str] = None` `class-attribute`

`Entity`

Bases: BaseModel

Source code in edsnlp/utils/examples.py

class Entity(BaseModel):
    start_char: int
    end_char: int
    modifiers: List[Modifier]

`start_char: int = None` `class-attribute`

`end_char: int = None` `class-attribute`

`modifiers: List[Modifier] = None` `class-attribute`

`find_matches(example)`

Finds entities within the example.

PARAMETER DESCRIPTION

example

Example to process.

TYPE: str

RETURNS	DESCRIPTION
`List[re.Match]`	List of matches for entities.

Source code in edsnlp/utils/examples.py

def find_matches(example: str) -> List[re.Match]:
    """
    Finds entities within the example.

    Parameters
    ----------
    example : str
        Example to process.

    Returns
    -------
    List[re.Match]
        List of matches for entities.
    """
    return list(entity_pattern.finditer(example))

`parse_match(match)`

Parse a regex match representing an entity.

PARAMETER DESCRIPTION

match

Match for an entity.

TYPE: re.Match

RETURNS	DESCRIPTION
`Match`	Usable representation for the entity match.

Source code in edsnlp/utils/examples.py

def parse_match(match: re.Match) -> Match:
    """
    Parse a regex match representing an entity.

    Parameters
    ----------
    match : re.Match
        Match for an entity.

    Returns
    -------
    Match
        Usable representation for the entity match.
    """

    lexical_variant = match.group()
    start_char = match.start()
    end_char = match.end()

    text = text_pattern.findall(lexical_variant)[0]
    modifiers = modifiers_pattern.findall(lexical_variant)[0]

    m = Match(start_char=start_char, end_char=end_char, text=text, modifiers=modifiers)

    return m

`parse_example(example)`

Parses an example : finds examples and removes the tags.

PARAMETER DESCRIPTION

example

Example to process.

TYPE: str

RETURNS	DESCRIPTION
`Tuple[str, List[Entity]]`	Cleaned text and extracted entities.

Source code in edsnlp/utils/examples.py

def parse_example(example: str) -> Tuple[str, List[Entity]]:
    """
    Parses an example : finds examples and removes the tags.

    Parameters
    ----------
    example : str
        Example to process.

    Returns
    -------
    Tuple[str, List[Entity]]
        Cleaned text and extracted entities.
    """

    matches = [parse_match(match) for match in find_matches(example=example)]
    text = ""
    entities = []

    cursor = 0

    for match in matches:

        text += example[cursor : match.start_char]
        start_char = len(text)
        text += match.text
        end_char = len(text)
        modifiers = [m.split("=") for m in match.modifiers.split()]

        cursor = match.end_char

        entity = Entity(
            start_char=start_char,
            end_char=end_char,
            modifiers=[Modifier(key=k, value=v) for k, v in modifiers],
        )

        entities.append(entity)

    text += example[cursor:]

    return text, entities

`deprecation`

`deprecated_extension(name, new_name)`

Source code in edsnlp/utils/deprecation.py

def deprecated_extension(name: str, new_name: str) -> None:
    msg = (
        f'The extension "{name}" is deprecated and will be '
        "removed in a future version. "
        f'Please use "{new_name}" instead.'
    )

    logger.warning(msg)

`deprecated_getter_factory(name, new_name)`

Source code in edsnlp/utils/deprecation.py

def deprecated_getter_factory(name: str, new_name: str) -> Callable:
    def getter(toklike: Union[Token, Span, Doc]) -> Any:

        n = f"{type(toklike).__name__}._.{name}"
        nn = f"{type(toklike).__name__}._.{new_name}"

        deprecated_extension(n, nn)

        return getattr(toklike._, new_name)

    return getter

`deprecation(name, new_name=None)`

Source code in edsnlp/utils/deprecation.py

def deprecation(name: str, new_name: Optional[str] = None):

    new_name = new_name or f"eds.{name}"

    msg = (
        f'Calling "{name}" directly is deprecated and '
        "will be removed in a future version. "
        f'Please use "{new_name}" instead.'
    )

    logger.warning(msg)

`deprecated_factory(name, new_name=None, default_config=None, func=None)`

Execute the Language.factory method on a modified factory function. The modification adds a deprecation warning.

PARAMETER	DESCRIPTION
`name`	The deprecated name for the pipeline TYPE: `str`
`new_name`	The new name for the pipeline, which should be used, by default None TYPE: `Optional[str], optional` DEFAULT: `None`
`default_config`	The configuration that should be passed to Language.factory, by default None TYPE: `Optional[Dict[str, Any]], optional` DEFAULT: `None`
`func`	The function to decorate, by default None TYPE: `Optional[Callable], optional` DEFAULT: `None`

RETURNS	DESCRIPTION
`Callable`

Source code in edsnlp/utils/deprecation.py

def deprecated_factory(
    name: str,
    new_name: Optional[str] = None,
    default_config: Optional[Dict[str, Any]] = None,
    func: Optional[Callable] = None,
) -> Callable:
    """
    Execute the Language.factory method on a modified factory function.
    The modification adds a deprecation warning.

    Parameters
    ----------
    name : str
        The deprecated name for the pipeline
    new_name : Optional[str], optional
        The new name for the pipeline, which should be used, by default None
    default_config : Optional[Dict[str, Any]], optional
        The configuration that should be passed to Language.factory, by default None
    func : Optional[Callable], optional
        The function to decorate, by default None

    Returns
    -------
    Callable
    """

    if default_config is None:
        default_config = dict()

    wrapper = Language.factory(name, default_config=default_config)

    def wrap(factory):

        # Define decorator
        # We use micheles' decorator package to keep the same signature
        # See https://github.com/micheles/decorator/
        @decorator
        def decorate(
            f,
            *args,
            **kwargs,
        ):
            deprecation(name, new_name)
            return f(
                *args,
                **kwargs,
            )

        decorated = decorate(factory)

        wrapper(decorated)

        return factory

    if func is not None:
        return wrap(func)

    return wrap

`regex`

`make_pattern(patterns, with_breaks=False, name=None)`

Create OR pattern from a list of patterns.

PARAMETER DESCRIPTION

patterns

List of patterns to merge.

TYPE: List[str]

with_breaks

Whether to add breaks (\b) on each side, by default False

TYPE: bool, optional DEFAULT: False

name

Name of the group, using regex ?P<> directive.

TYPE: Optional[str] DEFAULT: None

RETURNS	DESCRIPTION
`str`	Merged pattern.

Source code in edsnlp/utils/regex.py

def make_pattern(
    patterns: List[str],
    with_breaks: bool = False,
    name: Optional[str] = None,
) -> str:
    r"""
    Create OR pattern from a list of patterns.

    Parameters
    ----------
    patterns : List[str]
        List of patterns to merge.
    with_breaks : bool, optional
        Whether to add breaks (`\b`) on each side, by default False
    name: str, optional
        Name of the group, using regex `?P<>` directive.

    Returns
    -------
    str
        Merged pattern.
    """

    if name:
        prefix = f"(?P<{name}>"
    else:
        prefix = "("

    # Sorting by length might be more efficient
    patterns.sort(key=len, reverse=True)

    pattern = prefix + "|".join(patterns) + ")"

    if with_breaks:
        pattern = r"\b" + pattern + r"\b"

    return pattern

`compile_regex(reg)`

This function tries to compile reg using the re module, and fallbacks to the regex module that is more permissive.

PARAMETER	DESCRIPTION
`reg`

RETURNS	DESCRIPTION
`Union[re.Pattern, regex.Pattern]`

Source code in edsnlp/utils/regex.py

def compile_regex(reg):
    """
    This function tries to compile `reg`  using the `re` module, and
    fallbacks to the `regex` module that is more permissive.

    Parameters
    ----------
    reg: str

    Returns
    -------
    Union[re.Pattern, regex.Pattern]
    """
    try:
        return re.compile(reg)
    except re.error:
        try:
            return regex.compile(reg)
        except regex.error:
            raise Exception("Could not compile: {}".format(repr(reg)))

edsnlp.utils

inclusion

check_inclusion(span, start, end)

filter

default_sort_key(span)

filter_spans(spans, label_to_remove=None, return_discarded=False, sort_key=default_sort_key)

consume_spans(spans, filter, second_chance=None)

get_spans(spans, label)

resources

get_verbs(verbs=None, check_contains=True)

examples

entity_pattern = re.compile('(<ent[^<>]*>[^<>]+</ent>)') module-attribute

text_pattern = re.compile('<ent.*>(.+)</ent>') module-attribute

modifiers_pattern = re.compile('<ent\\s?(.*)>.+</ent>') module-attribute

Match

start_char: int = None class-attribute

end_char: int = None class-attribute

text: str = None class-attribute

modifiers: str = None class-attribute

Modifier

key: str = None class-attribute

value: Union[int, float, bool, str] = None class-attribute

Entity

start_char: int = None class-attribute

end_char: int = None class-attribute

modifiers: List[Modifier] = None class-attribute

find_matches(example)

parse_match(match)

parse_example(example)

deprecation

deprecated_extension(name, new_name)

deprecated_getter_factory(name, new_name)

deprecation(name, new_name=None)

deprecated_factory(name, new_name=None, default_config=None, func=None)

regex

make_pattern(patterns, with_breaks=False, name=None)

compile_regex(reg)

`edsnlp.utils`

`inclusion`

`check_inclusion(span, start, end)`

`filter`

`default_sort_key(span)`

`filter_spans(spans, label_to_remove=None, return_discarded=False, sort_key=default_sort_key)`

`consume_spans(spans, filter, second_chance=None)`

`get_spans(spans, label)`

`resources`

`get_verbs(verbs=None, check_contains=True)`

`examples`

`entity_pattern = re.compile('(<ent[^<>]*>[^<>]+</ent>)')` `module-attribute`

`text_pattern = re.compile('<ent.*>(.+)</ent>')` `module-attribute`

`modifiers_pattern = re.compile('<ent\\s?(.*)>.+</ent>')` `module-attribute`

`Match`

`start_char: int = None` `class-attribute`

`end_char: int = None` `class-attribute`

`text: str = None` `class-attribute`

`modifiers: str = None` `class-attribute`

`Modifier`

`key: str = None` `class-attribute`

`value: Union[int, float, bool, str] = None` `class-attribute`

`Entity`

`start_char: int = None` `class-attribute`

`end_char: int = None` `class-attribute`

`modifiers: List[Modifier] = None` `class-attribute`

`find_matches(example)`

`parse_match(match)`

`parse_example(example)`

`deprecation`

`deprecated_extension(name, new_name)`

`deprecated_getter_factory(name, new_name)`

`deprecation(name, new_name=None)`

`deprecated_factory(name, new_name=None, default_config=None, func=None)`

`regex`

`make_pattern(patterns, with_breaks=False, name=None)`

`compile_regex(reg)`