Measurements

The eds.measurements matcher detects and normalizes numerical measurements within a medical document.

Warning

The measurements pipeline is still in active development and has not been rigorously validated. If you come across a measurement expression that goes undetected, please file an issue !

Scope

The eds.measurements matcher can extract simple (e.g. 3cm) measurements. It can also detect elliptic enumerations (eg 32, 33 et 34kg) of measurements of the same type and split the measurements accordingly.

The normalized value can then be accessed via the span._.{measure_name} attribute, for instance span._.size or span._.weight and be converted on the fly to a desired unit. Like for other components, the span._.value extension can also be used to access the normalized value for any measurement span.

The current matcher annotates the following measurements out of the box:

Measurement name	Example
`size`	`1m50`, `1.50m`
`weight`	`12kg`, `1kg300`
`bmi`	`BMI: 24`, `24 kg.m-2`
`volume`	`2 cac`, `8ml`

Examples

import edsnlp

nlp = edsnlp.blank("eds")
nlp.add_pipe(
    "eds.measurements",
    config=dict(
        measurements=["size", "weight", "bmi"],
        extract_ranges=True,
    ),
)

text = """
Le patient est admis hier, fait 1m78 pour 76kg.
Les deux nodules bénins sont larges de 1,2 et 2.4mm.
BMI: 24.

Le nodule fait entre 1 et 1.5 cm
"""

doc = nlp(text)

measurements = doc.spans["measurements"]

measurements
# Out: [1m78, 76kg, 1,2, 2.4mm, 24, entre 1 et 1.5 cm]

measurements[0]
# Out: 1m78

str(measurements[0]._.size), str(measurements[0]._.value)
# Out: ('1.78 m', '1.78 m')

measurements[0]._.value.cm
# Out: 178.0

measurements[2]
# Out: 1,2

str(measurements[2]._.value)
# Out: '1.2 mm'

str(measurements[2]._.value.mm)
# Out: 1.2

measurements[4]
# Out: 24

str(measurements[4]._.value)
# Out: '24 kg_per_m2'

str(measurements[4]._.value.kg_per_m2)
# Out: 24

str(measurements[5]._.value)
# Out: 1-1.5 cm

To extract all sizes in centimeters, and average range measurements, you can use the following snippet:

sizes = [
    sum(item.cm for item in m._.value) / len(m._.value)
    for m in doc.spans["measurements"]
    if m.label_ == "size"
]
sizes
# Out: [178.0, 0.12, 0.24, 1.25]

Customization

You can declare custom measurements by altering the patterns:

import edsnlp

nlp = edsnlp.blank("eds")
nlp.add_pipe(
    "eds.measurements",
    config=dict(
        measurements={
            "my_custom_surface_measurement": {
                # This measurement unit is homogenous to square meters
                "unit": "m2",
                # Handle cases like "surface: 1.8" (implied m2),
                # vs "surface: 50" (implied cm2)
                "unitless_patterns": [
                    {
                        "terms": ["surface", "aire"],
                        "ranges": [
                            {"unit": "m2", "min": 0, "max": 9},
                            {"unit": "cm2", "min": 10, "max": 100},
                        ],
                    }
                ],
            },
        }
    ),
)

Extensions

The eds.measurements pipeline declares its extensions dynamically, depending on the measurements parameter: each measurement gets its own extension, and is assigned to a different span group.

Parameters

PARAMETER	DESCRIPTION
`nlp`	The pipeline object TYPE: `PipelineProtocol`
`name`	The name of the component. TYPE: `str`
`measurements`	A mapping from measure names to MsrConfig Each measure's configuration has the following shape: `{ # the unit (e.g. "kg"), "unit": str, "unitless_patterns": { # preceding trigger terms "terms": List[str], # unitless ranges -> unit patterns "ranges": List[ {"min": int, "max": int, "unit": str}, {"min": int, "unit": str}, ..., ], ... } }` TYPE: `Union[str, List[Union[str, MsrConfig]], Dict[str, MsrConfig]]` DEFAULT: `['weight', 'size', 'bmi', 'volume']`
`number_terms`	A mapping of numbers to their lexical variants DEFAULT: `{'0.125': ['⅛'], '0.16666666': ['⅙'], '0.2': ['...`
`stopwords`	A list of stopwords that do not matter when placed between a unitless trigger and a number DEFAULT: `['par', 'sur', 'de', 'a', ',', 'et']`
`unit_divisors`	A list of terms used to divide two units (like: m / s) DEFAULT: `['/', 'par']`
`attr`	Whether to match on the text ('TEXT') or on the normalized text ('NORM') TYPE: `str` DEFAULT: `NORM`
`ignore_excluded`	Whether to exclude pollution patterns when matching in the text TYPE: `bool` DEFAULT: `True`
`compose_units`	Whether to compose units (like "m/s" or "m.s-1") DEFAULT: `True`
`extract_ranges`	Whether to extract ranges (like "entre 1 et 2 cm") DEFAULT: `False`
`range_patterns`	A list of "{FROM} xx {TO} yy" patterns to match range measurements DEFAULT: `[('De', 'à'), ('De', 'a'), ('de', 'à'), ('de', ...`
`after_snippet_limit`	Maximum word distance after to link a part of a measurement after its number DEFAULT: `6`
`before_snippet_limit`	Maximum word distance after to link a part of a measurement before its number DEFAULT: `10`
`span_setter`	How to set the spans in the document. By default, each measurement will be assigned to its own span group (using either the "name" field of the config, or the key if you passed a dict), and to the "measurements" group. DEFAULT: `None`
`span_getter`	Where to look for measurements in the doc. By default, look in the whole doc. You can combine this with the `merge_mode` argument for interesting results. TYPE: `SpanGetterArg` DEFAULT: `None`
`merge_mode`	How to merge matches with the spans from `span_getter`, if given: `intersect`: return only the matches that fall in the `span_getter` spans `align`: if a match overlaps a span from `span_getter` (e.g. a match extracted by a machine learning model), return the `span_getter` span instead, and assign all the parsed information (`._.date` / `._.duration`) to it. Otherwise, don't return the date. TYPE: `Literal['intersect', 'align']` DEFAULT: `intersect`

Authors and citation

The eds.measurements pipeline was developed by AP-HP's Data Science team.