Skip to content

Dates[source]

The eds.dates matcher detects and normalize dates within a medical document. We use simple regular expressions to extract date mentions.

Scope

The eds.dates pipeline finds absolute (eg 23/08/2021) and relative (eg hier, la semaine dernière) dates alike. It also handles mentions of duration.

Type Example
absolute 3 mai, 03/05/2020
relative hier, la semaine dernière
duration pendant quatre jours

See the tutorial for a presentation of a full pipeline featuring the eds.dates component.

Usage

import edsnlp, edsnlp.pipes as eds
import datetime
import pytz

nlp = edsnlp.blank("eds")
nlp.add_pipe(eds.dates())

text = (
    "Le patient est admis le 23 août 2021 pour une douleur à l'estomac. "
    "Il lui était arrivé la même chose il y a un an pendant une semaine. "
    "Il a été diagnostiqué en mai 1995."
)

doc = nlp(text)

dates = doc.spans["dates"]
dates
# Out: [23 août 2021, il y a un an, mai 1995]

dates[0]._.date.to_datetime()
# Out: 2021-08-23T00:00:00+02:00

dates[1]._.date.to_datetime()
# Out: None

note_datetime = datetime.datetime(2021, 8, 27, tzinfo=pytz.timezone("Europe/Paris"))
doc._.note_datetime = note_datetime

dates[1]._.date.to_datetime()
# Out: 2020-08-27T00:00:00+02:00

date_2_output = dates[2]._.date.to_datetime(
    note_datetime=note_datetime,
    infer_from_context=True,
    tz="Europe/Paris",
    default_day=15,
)
date_2_output
# Out: 1995-05-15T00:00:00+02:00

doc.spans["durations"]
# Out: [pendant une semaine]

Example on a collection of documents stored in the OMOP schema :

import edsnlp, edsnlp.pipes as eds

# with cols "note_id", "note_text" and optionally "note_datetime"
my_omop_df = ...
nlp = edsnlp.blank("eds")
nlp.add_pipe(eds.dates(as_ents=True))
docs = edsnlp.data.from_pandas(my_omop_df)
docs = docs.map_pipeline(nlp)
docs = docs.to_pandas(
    converter="ents",
    span_attributes=["date.datetime"],
)
print(docs)
# note_id  start  end label lexical_variant span_type datetime
# ...

Extensions

The eds.dates pipeline declares two extensions on the Span object:

  • the span._.date attribute of a date contains a parsed version of the date.
  • the span._.duration attribute of a duration contains a parsed version of the duration.

As with other components, you can use the span._.value attribute to get either the parsed date or the duration depending on the span.

Parameters

PARAMETER DESCRIPTION
nlp

The pipeline object

TYPE: PipelineProtocol

name

Name of the pipeline component

TYPE: Optional[str]

absolute

List of regular expressions for absolute dates.

TYPE: Union[List[str], str] DEFAULT: None

relative

List of regular expressions for relative dates (eg hier, la semaine prochaine).

TYPE: Union[List[str], str] DEFAULT: None

duration

List of regular expressions for durations (eg pendant trois mois).

TYPE: Union[List[str], str] DEFAULT: None

false_positive

List of regular expressions for false positive (eg phone numbers, etc).

TYPE: Union[List[str], str] DEFAULT: None

span_getter

Where to look for dates in the doc. By default, look in the whole doc. You can combine this with the merge_mode argument for interesting results.

TYPE: SpanGetterArg DEFAULT: None

merge_mode

How to merge matched dates with the spans from span_getter, if given:

  • intersect: return only the matches that fall in the span_getter spans
  • align: if a date overlaps a span from span_getter (e.g. a date extracted by a machine learning model), return the span_getter span instead, and assign all the parsed information (._.date / ._.duration) to it. Otherwise don't return the date.

TYPE: Literal['intersect', 'align'] DEFAULT: intersect

on_ents_only

Deprecated, use span_getter and merge_mode instead. Whether to look on dates in the whole document or in specific sentences:

  • If True: Only look in the sentences of each entity in doc.ents
  • If False: Look in the whole document
  • If given a string key or list of string: Only look in the sentences of each entity in doc.spans[key]

TYPE: Union[bool, str, Iterable[str]] DEFAULT: None

detect_periods

Whether to detect periods (experimental)

TYPE: bool DEFAULT: False

detect_time

Whether to detect time inside dates

DEFAULT: True

period_proximity_threshold

Max number of words between two dates to extract a period.

TYPE: int DEFAULT: 3

as_ents

Deprecated, use span_setter instead. Whether to treat dates as entities

TYPE: bool DEFAULT: False

attr

spaCy attribute to use

TYPE: str DEFAULT: LOWER

date_label

Label to use for dates

TYPE: str DEFAULT: date

duration_label

Label to use for durations

TYPE: str DEFAULT: duration

period_label

Label to use for periods

TYPE: str DEFAULT: period

span_setter

How to set matches in the doc.

TYPE: SpanSetterArg DEFAULT: {'dates': ['date'], 'durations': ['duration'], ...

explain

Whether to keep track of regex cues for each entity.

TYPE: bool DEFAULT: False

Authors and citation

The eds.dates pipeline was developed by AP-HP's Data Science team.