Dates[source]

The eds.dates matcher detects and normalize dates within a medical document. We use simple regular expressions to extract date mentions.

Scope

The eds.dates pipeline finds absolute (eg 23/08/2021) and relative (eg hier, la semaine dernière) dates alike. It also handles mentions of duration.

Type	Example
`absolute`	`3 mai`, `03/05/2020`
`relative`	`hier`, `la semaine dernière`
`duration`	`pendant quatre jours`

See the tutorial for a presentation of a full pipeline featuring the eds.dates component.

Usage

import edsnlp, edsnlp.pipes as eds
import datetime
import pytz

nlp = edsnlp.blank("eds")
nlp.add_pipe(eds.dates())

text = (
    "Le patient est admis le 23 août 2021 pour une douleur à l'estomac. "
    "Il lui était arrivé la même chose il y a un an pendant une semaine. "
    "Il a été diagnostiqué en mai 1995."
)

doc = nlp(text)

dates = doc.spans["dates"]
dates
# Out: [23 août 2021, il y a un an, mai 1995]

dates[0]._.date.to_datetime()
# Out: 2021-08-23T00:00:00+02:00

dates[1]._.date.to_datetime()
# Out: None

note_datetime = datetime.datetime(2021, 8, 27, tzinfo=pytz.timezone("Europe/Paris"))
doc._.note_datetime = note_datetime

dates[1]._.date.to_datetime()
# Out: 2020-08-27T00:00:00+02:00

date_2_output = dates[2]._.date.to_datetime(
    note_datetime=note_datetime,
    infer_from_context=True,
    tz="Europe/Paris",
    default_day=15,
)
date_2_output
# Out: 1995-05-15T00:00:00+02:00

doc.spans["durations"]
# Out: [pendant une semaine]

Example on a collection of documents stored in the OMOP schema :

import edsnlp, edsnlp.pipes as eds

# with cols "note_id", "note_text" and optionally "note_datetime"
my_omop_df = ...
nlp = edsnlp.blank("eds")
nlp.add_pipe(eds.dates(as_ents=True))
docs = edsnlp.data.from_pandas(my_omop_df)
docs = docs.map_pipeline(nlp)
docs = docs.to_pandas(
    converter="ents",
    span_attributes=["date.datetime"],
)
print(docs)
# note_id  start  end label lexical_variant span_type datetime
# ...

Extensions

The eds.dates pipeline declares two extensions on the Span object:

the span._.date attribute of a date contains a parsed version of the date.
the span._.duration attribute of a duration contains a parsed version of the duration.

As with other components, you can use the span._.value attribute to get either the parsed date or the duration depending on the span.

Parameters

PARAMETER	DESCRIPTION
`nlp`	The pipeline object TYPE: `PipelineProtocol`
`name`	Name of the pipeline component TYPE: `Optional[str]`
`absolute`	List of regular expressions for absolute dates. TYPE: `Union[List[str], str]` DEFAULT: `None`
`relative`	List of regular expressions for relative dates (eg `hier`, `la semaine prochaine`). TYPE: `Union[List[str], str]` DEFAULT: `None`
`duration`	List of regular expressions for durations (eg `pendant trois mois`). TYPE: `Union[List[str], str]` DEFAULT: `None`
`false_positive`	List of regular expressions for false positive (eg phone numbers, etc). TYPE: `Union[List[str], str]` DEFAULT: `None`
`span_getter`	Where to look for dates in the doc. By default, look in the whole doc. You can combine this with the `merge_mode` argument for interesting results. TYPE: `SpanGetterArg` DEFAULT: `None`
`merge_mode`	How to merge matched dates with the spans from `span_getter`, if given: `intersect`: return only the matches that fall in the `span_getter` spans `align`: if a date overlaps a span from `span_getter` (e.g. a date extracted by a machine learning model), return the `span_getter` span instead, and assign all the parsed information (`._.date` / `._.duration`) to it. Otherwise don't return the date. TYPE: `Literal['intersect', 'align']` DEFAULT: `intersect`
`on_ents_only`	Deprecated, use `span_getter` and `merge_mode` instead. Whether to look on dates in the whole document or in specific sentences: If `True`: Only look in the sentences of each entity in doc.ents If False: Look in the whole document If given a string `key` or list of string: Only look in the sentences of each entity in `doc.spans[key]` TYPE: `Union[bool, str, Iterable[str]]` DEFAULT: `None`
`detect_periods`	Whether to detect periods (experimental) TYPE: `bool` DEFAULT: `False`
`detect_time`	Whether to detect time inside dates DEFAULT: `True`
`period_proximity_threshold`	Max number of words between two dates to extract a period. TYPE: `int` DEFAULT: `3`
`as_ents`	Deprecated, use span_setter instead. Whether to treat dates as entities TYPE: `bool` DEFAULT: `False`
`attr`	spaCy attribute to use TYPE: `str` DEFAULT: `LOWER`
`date_label`	Label to use for dates TYPE: `str` DEFAULT: `date`
`duration_label`	Label to use for durations TYPE: `str` DEFAULT: `duration`
`period_label`	Label to use for periods TYPE: `str` DEFAULT: `period`
`span_setter`	How to set matches in the doc. TYPE: `SpanSetterArg` DEFAULT: `{'dates': ['date'], 'durations': ['duration'], ...`
`explain`	Whether to keep track of regex cues for each entity. TYPE: `bool` DEFAULT: `False`

Authors and citation

The eds.dates pipeline was developed by AP-HP's Data Science team.