Skip to content

Measurements

The eds.measurements matcher detects and normalizes numerical measurements within a medical document.

Warning

The measurements pipeline is still in active development and has not been rigorously validated. If you come across a measurement expression that goes undetected, please file an issue !

Scope

The eds.measurements matcher can extract simple (e.g. 3cm) measurements. It can also detect elliptic enumerations (eg 32, 33 et 34kg) of measurements of the same type and split the measurements accordingly.

The normalized value can then be accessed via the span._.{measure_name} attribute, for instance span._.size or span._.weight and be converted on the fly to a desired unit. Like for other components, the span._.value extension can also be used to access the normalized value for any measurement span.

The current matcher annotates the following measurements out of the box:

Measurement name Example
size 1m50, 1.50m
weight 12kg, 1kg300
bmi BMI: 24, 24 kg.m-2
volume 2 cac, 8ml

Examples

import edsnlp

nlp = edsnlp.blank("eds")
nlp.add_pipe(
    "eds.measurements",
    config=dict(
        measurements=["size", "weight", "bmi"],
        extract_ranges=True,
    ),
)

text = """
Le patient est admis hier, fait 1m78 pour 76kg.
Les deux nodules bénins sont larges de 1,2 et 2.4mm.
BMI: 24.

Le nodule fait entre 1 et 1.5 cm
"""

doc = nlp(text)

measurements = doc.spans["measurements"]

measurements
# Out: [1m78, 76kg, 1,2, 2.4mm, 24, entre 1 et 1.5 cm]

measurements[0]
# Out: 1m78

str(measurements[0]._.size), str(measurements[0]._.value)
# Out: ('1.78 m', '1.78 m')

measurements[0]._.value.cm
# Out: 178.0

measurements[2]
# Out: 1,2

str(measurements[2]._.value)
# Out: '1.2 mm'

str(measurements[2]._.value.mm)
# Out: 1.2

measurements[4]
# Out: 24

str(measurements[4]._.value)
# Out: '24 kg_per_m2'

str(measurements[4]._.value.kg_per_m2)
# Out: 24

str(measurements[5]._.value)
# Out: 1-1.5 cm

To extract all sizes in centimeters, and average range measurements, you can use the following snippet:

sizes = [
    sum(item.cm for item in m._.value) / len(m._.value)
    for m in doc.spans["measurements"]
    if m.label_ == "size"
]
sizes
# Out: [178.0, 0.12, 0.24, 1.25]

Customization

You can declare custom measurements by altering the patterns:

import edsnlp

nlp = edsnlp.blank("eds")
nlp.add_pipe(
    "eds.measurements",
    config=dict(
        measurements={
            "my_custom_surface_measurement": {
                # This measurement unit is homogenous to square meters
                "unit": "m2",
                # Handle cases like "surface: 1.8" (implied m2),
                # vs "surface: 50" (implied cm2)
                "unitless_patterns": [
                    {
                        "terms": ["surface", "aire"],
                        "ranges": [
                            {"unit": "m2", "min": 0, "max": 9},
                            {"unit": "cm2", "min": 10, "max": 100},
                        ],
                    }
                ],
            },
        }
    ),
)

Extensions

The eds.measurements pipeline declares its extensions dynamically, depending on the measurements parameter: each measurement gets its own extension, and is assigned to a different span group.

Parameters

PARAMETER DESCRIPTION
nlp

The pipeline object

TYPE: PipelineProtocol

name

The name of the component.

TYPE: str

measurements

A mapping from measure names to MsrConfig Each measure's configuration has the following shape:

{
  # the unit (e.g. "kg"),
  "unit": str,
  "unitless_patterns": {
    # preceding trigger terms
    "terms": List[str],
    # unitless ranges -> unit patterns
    "ranges": List[
      {"min": int, "max": int, "unit": str},
      {"min": int, "unit": str},
      ...,
    ],
    ...
  }
}

TYPE: Union[str, List[Union[str, MsrConfig]], Dict[str, MsrConfig]] DEFAULT: ['weight', 'size', 'bmi', 'volume']

number_terms

A mapping of numbers to their lexical variants

DEFAULT: {'0.125': ['⅛'], '0.16666666': ['⅙'], '0.2': ['...

stopwords

A list of stopwords that do not matter when placed between a unitless trigger and a number

DEFAULT: ['par', 'sur', 'de', 'a', ',', 'et']

unit_divisors

A list of terms used to divide two units (like: m / s)

DEFAULT: ['/', 'par']

attr

Whether to match on the text ('TEXT') or on the normalized text ('NORM')

TYPE: str DEFAULT: NORM

ignore_excluded

Whether to exclude pollution patterns when matching in the text

TYPE: bool DEFAULT: True

compose_units

Whether to compose units (like "m/s" or "m.s-1")

DEFAULT: True

extract_ranges

Whether to extract ranges (like "entre 1 et 2 cm")

DEFAULT: False

range_patterns

A list of "{FROM} xx {TO} yy" patterns to match range measurements

DEFAULT: [('De', 'à'), ('De', 'a'), ('de', 'à'), ('de', ...

after_snippet_limit

Maximum word distance after to link a part of a measurement after its number

DEFAULT: 6

before_snippet_limit

Maximum word distance after to link a part of a measurement before its number

DEFAULT: 10

span_setter

How to set the spans in the document. By default, each measurement will be assigned to its own span group (using either the "name" field of the config, or the key if you passed a dict), and to the "measurements" group.

DEFAULT: None

span_getter

Where to look for measurements in the doc. By default, look in the whole doc. You can combine this with the merge_mode argument for interesting results.

TYPE: SpanGetterArg DEFAULT: None

merge_mode

How to merge matches with the spans from span_getter, if given:

  • intersect: return only the matches that fall in the span_getter spans
  • align: if a match overlaps a span from span_getter (e.g. a match extracted by a machine learning model), return the span_getter span instead, and assign all the parsed information (._.date / ._.duration) to it. Otherwise, don't return the date.

TYPE: Literal['intersect', 'align'] DEFAULT: intersect

Authors and citation

The eds.measurements pipeline was developed by AP-HP's Data Science team.