Skip to content

edsnlp.pipelines.misc.dates

dates

Dates

Bases: BaseComponent

Tags and normalizes dates, using the open-source dateparser library.

The pipeline uses spaCy's filter_spans function. It filters out false positives, and introduce a hierarchy between patterns. For instance, in case of ambiguity, the pipeline will decide that a date is a date without a year rather than a date without a day.

PARAMETER DESCRIPTION
nlp

Language pipeline object

TYPE: spacy.language.Language

absolute

List of regular expressions for absolute dates.

TYPE: Union[List[str], str]

full

List of regular expressions for full dates in YYYY-MM-DD format.

TYPE: Union[List[str], str]

relative

List of regular expressions for relative dates (eg hier, la semaine prochaine).

TYPE: Union[List[str], str]

no_year

List of regular expressions for dates that do not display a year.

TYPE: Union[List[str], str]

no_day

List of regular expressions for dates that do not display a day.

TYPE: Union[List[str], str]

year_only

List of regular expressions for dates that only display a year.

TYPE: Union[List[str], str]

current

List of regular expressions for dates that relate to the current month, week, year, etc.

TYPE: Union[List[str], str]

false_positive

List of regular expressions for false positive (eg phone numbers, etc).

TYPE: Union[List[str], str]

on_ents_only

Wether to look on dates in the whole document or in specific sentences:

  • If True: Only look in the sentences of each entity in doc.ents
  • If False: Look in the whole document
  • If given a string key or list of string: Only look in the sentences of each entity in doc.spans[key]

TYPE: Union[bool, str, List[str]]

Source code in edsnlp/pipelines/misc/dates/dates.py
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
class Dates(BaseComponent):
    """
    Tags and normalizes dates, using the open-source `dateparser` library.

    The pipeline uses spaCy's `filter_spans` function.
    It filters out false positives, and introduce a hierarchy between patterns.
    For instance, in case of ambiguity, the pipeline will decide that a date is a
    date without a year rather than a date without a day.

    Parameters
    ----------
    nlp : spacy.language.Language
        Language pipeline object
    absolute : Union[List[str], str]
        List of regular expressions for absolute dates.
    full : Union[List[str], str]
        List of regular expressions for full dates in YYYY-MM-DD format.
    relative : Union[List[str], str]
        List of regular expressions for relative dates
        (eg `hier`, `la semaine prochaine`).
    no_year : Union[List[str], str]
        List of regular expressions for dates that do not display a year.
    no_day : Union[List[str], str]
        List of regular expressions for dates that do not display a day.
    year_only : Union[List[str], str]
        List of regular expressions for dates that only display a year.
    current : Union[List[str], str]
        List of regular expressions for dates that relate to
        the current month, week, year, etc.
    false_positive : Union[List[str], str]
        List of regular expressions for false positive (eg phone numbers, etc).
    on_ents_only : Union[bool, str, List[str]]
        Wether to look on dates in the whole document or in specific sentences:

        - If `True`: Only look in the sentences of each entity in doc.ents
        - If False: Look in the whole document
        - If given a string `key` or list of string: Only look in the sentences of
          each entity in `#!python doc.spans[key]`
    """

    # noinspection PyProtectedMember
    def __init__(
        self,
        nlp: Language,
        absolute: Optional[List[str]],
        full: Optional[List[str]],
        relative: Optional[List[str]],
        no_year: Optional[List[str]],
        no_day: Optional[List[str]],
        year_only: Optional[List[str]],
        current: Optional[List[str]],
        false_positive: Optional[List[str]],
        on_ents_only: bool,
        attr: str,
    ):

        self.nlp = nlp

        if no_year is None:
            no_year = patterns.no_year_pattern
        if year_only is None:
            year_only = patterns.full_year_pattern
        if no_day is None:
            no_day = patterns.no_day_pattern
        if absolute is None:
            absolute = patterns.absolute_date_pattern
        if relative is None:
            relative = patterns.relative_date_pattern
        if full is None:
            full = patterns.full_date_pattern
        if current is None:
            current = patterns.current_pattern
        if false_positive is None:
            false_positive = patterns.false_positive_pattern

        if isinstance(absolute, str):
            absolute = [absolute]
        if isinstance(relative, str):
            relative = [relative]
        if isinstance(no_year, str):
            no_year = [no_year]
        if isinstance(no_day, str):
            no_day = [no_day]
        if isinstance(year_only, str):
            year_only = [year_only]
        if isinstance(full, str):
            full = [full]
        if isinstance(current, str):
            current = [current]
        if isinstance(false_positive, str):
            false_positive = [false_positive]

        self.on_ents_only = on_ents_only
        self.regex_matcher = RegexMatcher(attr=attr, alignment_mode="strict")

        self.regex_matcher.add("false_positive", false_positive)
        self.regex_matcher.add("full_date", full)
        self.regex_matcher.add("absolute", absolute)
        self.regex_matcher.add("relative", relative)
        self.regex_matcher.add("no_year", no_year)
        self.regex_matcher.add("no_day", no_day)
        self.regex_matcher.add("year_only", year_only)
        self.regex_matcher.add("current", current)

        self.parser = date_parser
        self.set_extensions()

    @staticmethod
    def set_extensions() -> None:

        if not Doc.has_extension("note_datetime"):
            Doc.set_extension("note_datetime", default=None)

        if not Span.has_extension("parsed_date"):
            Span.set_extension("parsed_date", default=None)

        if not Span.has_extension("parsed_delta"):
            Span.set_extension("parsed_delta", default=None)

        if not Span.has_extension("date"):
            Span.set_extension("date", getter=date_getter)

    def process(self, doc: Doc) -> List[Span]:
        """
        Find dates in doc.

        Parameters
        ----------
        doc:
            spaCy Doc object

        Returns
        -------
        dates:
            list of date spans
        """

        if self.on_ents_only:

            if type(self.on_ents_only) == bool:
                ents = doc.ents
            else:
                if type(self.on_ents_only) == str:
                    self.on_ents_only = [self.on_ents_only]
                ents = []
                for key in self.on_ents_only:
                    ents.extend(list(doc.spans[key]))

            dates = []
            for sent in set([ent.sent for ent in ents]):
                dates = chain(
                    dates,
                    self.regex_matcher(
                        sent,
                        as_spans=True,
                        # return_groupdict=True,
                    ),
                )

        else:
            dates = self.regex_matcher(
                doc,
                as_spans=True,
                # return_groupdict=True,
            )

        # dates = apply_groupdict(dates)

        dates = filter_spans(dates)
        dates = [date for date in dates if date.label_ != "false_positive"]

        return dates

    def get_date(self, date: Span) -> Optional[datetime]:
        """
        Get normalised date using `dateparser`.

        Parameters
        ----------
        date : Span
            Date span.

        Returns
        -------
        Optional[datetime]
            If a date is recognised, returns a Python `datetime` object.
            Returns `None` otherwise.
        """

        text_date = date.text

        if date.label_ == "no_day":
            text_date = "01/" + re.sub(r"[\.\/\s]", "/", text_date)

        elif date.label_ == "full_date":
            text_date = re.sub(r"[\.\/\s]", "-", text_date)

            try:
                return datetime.strptime(text_date, "%Y-%m-%d")
            except ValueError:
                try:
                    return datetime.strptime(text_date, "%Y-%d-%m")
                except ValueError:
                    return None

        # text_date = re.sub(r"\.", "-", text_date)

        return self.parser(text_date)

    def __call__(self, doc: Doc) -> Doc:
        """
        Tags dates.

        Parameters
        ----------
        doc:
            spaCy Doc object

        Returns
        -------
        doc:
            spaCy Doc object, annotated for dates
        """
        dates = self.process(doc)

        for date in dates:
            d = self.get_date(date)

            if d is None:
                date._.parsed_date = None
            else:
                date._.parsed_date = d
                date._.parsed_delta = d - datetime.now() + timedelta(seconds=10)

        doc.spans["dates"] = dates

        return doc
process(doc)

Find dates in doc.

PARAMETER DESCRIPTION
doc

spaCy Doc object

TYPE: Doc

RETURNS DESCRIPTION
dates

list of date spans

Source code in edsnlp/pipelines/misc/dates/dates.py
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
def process(self, doc: Doc) -> List[Span]:
    """
    Find dates in doc.

    Parameters
    ----------
    doc:
        spaCy Doc object

    Returns
    -------
    dates:
        list of date spans
    """

    if self.on_ents_only:

        if type(self.on_ents_only) == bool:
            ents = doc.ents
        else:
            if type(self.on_ents_only) == str:
                self.on_ents_only = [self.on_ents_only]
            ents = []
            for key in self.on_ents_only:
                ents.extend(list(doc.spans[key]))

        dates = []
        for sent in set([ent.sent for ent in ents]):
            dates = chain(
                dates,
                self.regex_matcher(
                    sent,
                    as_spans=True,
                    # return_groupdict=True,
                ),
            )

    else:
        dates = self.regex_matcher(
            doc,
            as_spans=True,
            # return_groupdict=True,
        )

    # dates = apply_groupdict(dates)

    dates = filter_spans(dates)
    dates = [date for date in dates if date.label_ != "false_positive"]

    return dates
get_date(date)

Get normalised date using dateparser.

PARAMETER DESCRIPTION
date

Date span.

TYPE: Span

RETURNS DESCRIPTION
Optional[datetime]

If a date is recognised, returns a Python datetime object. Returns None otherwise.

Source code in edsnlp/pipelines/misc/dates/dates.py
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
def get_date(self, date: Span) -> Optional[datetime]:
    """
    Get normalised date using `dateparser`.

    Parameters
    ----------
    date : Span
        Date span.

    Returns
    -------
    Optional[datetime]
        If a date is recognised, returns a Python `datetime` object.
        Returns `None` otherwise.
    """

    text_date = date.text

    if date.label_ == "no_day":
        text_date = "01/" + re.sub(r"[\.\/\s]", "/", text_date)

    elif date.label_ == "full_date":
        text_date = re.sub(r"[\.\/\s]", "-", text_date)

        try:
            return datetime.strptime(text_date, "%Y-%m-%d")
        except ValueError:
            try:
                return datetime.strptime(text_date, "%Y-%d-%m")
            except ValueError:
                return None

    # text_date = re.sub(r"\.", "-", text_date)

    return self.parser(text_date)
__call__(doc)

Tags dates.

PARAMETER DESCRIPTION
doc

spaCy Doc object

TYPE: Doc

RETURNS DESCRIPTION
doc

spaCy Doc object, annotated for dates

Source code in edsnlp/pipelines/misc/dates/dates.py
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
def __call__(self, doc: Doc) -> Doc:
    """
    Tags dates.

    Parameters
    ----------
    doc:
        spaCy Doc object

    Returns
    -------
    doc:
        spaCy Doc object, annotated for dates
    """
    dates = self.process(doc)

    for date in dates:
        d = self.get_date(date)

        if d is None:
            date._.parsed_date = None
        else:
            date._.parsed_date = d
            date._.parsed_delta = d - datetime.now() + timedelta(seconds=10)

    doc.spans["dates"] = dates

    return doc

td2str(td)

Transforms a timedelta object to a string representation.

PARAMETER DESCRIPTION
td

The timedelta object to represent.

TYPE: timedelta

RETURNS DESCRIPTION
str

Usable representation for the timedelta object.

Source code in edsnlp/pipelines/misc/dates/dates.py
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
def td2str(td: timedelta):
    """
    Transforms a timedelta object to a string representation.

    Parameters
    ----------
    td : timedelta
        The timedelta object to represent.

    Returns
    -------
    str
        Usable representation for the timedelta object.
    """
    seconds = td.total_seconds()
    days = int(seconds / 3600 / 24)
    return f"TD{days:+d}"

date_getter(date)

Getter for dates. Uses the information from note_datetime.

PARAMETER DESCRIPTION
date

Date detected by the pipeline.

TYPE: Span

RETURNS DESCRIPTION
str

Normalized date.

Source code in edsnlp/pipelines/misc/dates/dates.py
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
def date_getter(date: Span) -> str:
    """
    Getter for dates. Uses the information from `note_datetime`.

    Parameters
    ----------
    date : Span
        Date detected by the pipeline.

    Returns
    -------
    str
        Normalized date.
    """

    d = date._.parsed_date

    if d is None:
        # dateparser could not interpret the date.
        return "????-??-??"

    delta = date._.parsed_delta
    note_datetime = date.doc._.note_datetime

    if date.label_ in {"absolute", "full_date", "no_day"}:
        normalized = d.strftime("%Y-%m-%d")
    elif date.label_ == "no_year":
        if note_datetime:
            year = note_datetime.strftime("%Y")
        else:
            year = "????"
        normalized = d.strftime(f"{year}-%m-%d")
    else:
        if note_datetime:
            # We need to adjust the timedelta, since most dates are set at 00h00.
            # The slightest difference leads to a day difference.
            d = note_datetime + delta
            normalized = d.strftime("%Y-%m-%d")
        else:
            normalized = td2str(d - datetime.now())

    return normalized

date_parser(text_date)

Function to parse dates. It try first all available parsers ('timestamp', 'custom-formats', 'absolute-time') but 'relative-time'. If no date is found, retries with 'relative-time'.

When just the year is identified, it returns a datetime object with month and day equal to 1.

PARAMETER DESCRIPTION
text_date

TYPE: str

RETURNS DESCRIPTION
datetime
Source code in edsnlp/pipelines/misc/dates/dates.py
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
def date_parser(text_date: str) -> datetime:
    """
    Function to parse dates. It try first all available parsers
    ('timestamp', 'custom-formats', 'absolute-time') but 'relative-time'.
    If no date is found, retries with 'relative-time'.

    When just the year is identified, it returns a datetime object with
    month and day equal to 1.


    Parameters
    ----------
    text_date : str

    Returns
    -------
    datetime
    """

    parsed_date = parser1.get_date_data(text_date)
    if parsed_date.date_obj:
        if parsed_date.period == "year":
            return datetime(year=parsed_date.date_obj.year, month=1, day=1)
        else:
            return parsed_date.date_obj
    else:
        parsed_date2 = parser2.get_date_data(text_date)
        return parsed_date2.date_obj

parse_groupdict(day=None, month=None, year=None, hour=None, minute=None, second=None, **kwargs)

Parse date groupdict.

PARAMETER DESCRIPTION
day

String representation of the day, by default None

TYPE: str, optional DEFAULT: None

month

String representation of the month, by default None

TYPE: str, optional DEFAULT: None

year

String representation of the year, by default None

TYPE: str, optional DEFAULT: None

hour

String representation of the hour, by default None

TYPE: str, optional DEFAULT: None

minute

String representation of the minute, by default None

TYPE: str, optional DEFAULT: None

second

String representation of the minute, by default None

TYPE: str, optional DEFAULT: None

RETURNS DESCRIPTION
Dict[str, int]

Parsed groupdict.

Source code in edsnlp/pipelines/misc/dates/dates.py
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
def parse_groupdict(
    day: str = None,
    month: str = None,
    year: str = None,
    hour: str = None,
    minute: str = None,
    second: str = None,
    **kwargs: Dict[str, str],
) -> Dict[str, int]:
    """
    Parse date groupdict.

    Parameters
    ----------
    day : str, optional
        String representation of the day, by default None
    month : str, optional
        String representation of the month, by default None
    year : str, optional
        String representation of the year, by default None
    hour : str, optional
        String representation of the hour, by default None
    minute : str, optional
        String representation of the minute, by default None
    second : str, optional
        String representation of the minute, by default None

    Returns
    -------
    Dict[str, int]
        Parsed groupdict.
    """

    result = dict()

    if day is not None:
        result["day"] = day2int(day)

    if month is not None:
        result["month"] = month2int(month)

    if year is not None:
        result["year"] = str2int(year)

    if hour is not None:
        result["hour"] = str2int(hour)

    if minute is not None:
        result["minute"] = str2int(minute)

    if second is not None:
        result["second"] = str2int(second)

    result.update(**kwargs)

    return result

parsing

str2int(time)

Converts a string to an integer. Returns None if the string cannot be converted.

PARAMETER DESCRIPTION
time

String representation

TYPE: str

RETURNS DESCRIPTION
int

Integer conversion.

Source code in edsnlp/pipelines/misc/dates/parsing.py
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def str2int(time: str) -> int:
    """
    Converts a string to an integer. Returns `None` if the string cannot be converted.

    Parameters
    ----------
    time : str
        String representation

    Returns
    -------
    int
        Integer conversion.
    """
    try:
        return int(time)
    except ValueError:
        return None

time2int_factory(patterns)

Factory for a time2int conversion function.

PARAMETER DESCRIPTION
patterns

Dictionary of conversion/pattern.

TYPE: Dict[str, int]

RETURNS DESCRIPTION
Callable[[str], int]

String to integer function.

Source code in edsnlp/pipelines/misc/dates/parsing.py
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
def time2int_factory(patterns: Dict[str, int]) -> Callable[[str], int]:
    """
    Factory for a `time2int` conversion function.

    Parameters
    ----------
    patterns : Dict[str, int]
        Dictionary of conversion/pattern.

    Returns
    -------
    Callable[[str], int]
        String to integer function.
    """

    def time2int(time: str) -> int:
        """
        Converts a string representation to the proper integer,
        iterating over a dictionnary of pattern/conversion.

        Parameters
        ----------
        time : str
            String representation

        Returns
        -------
        int
            Integer conversion
        """
        m = str2int(time)

        if m is not None:
            return m

        for pattern, key in patterns.items():
            if re.match(f"^{pattern}$", time):
                m = key
                break

        return m

    return time2int
Back to top