Quantities[source]

The eds.quantities matcher detects and normalizes numerical quantities within a medical document.

Warning

The quantities pipeline is still in active development and has not been rigorously validated. If you come across a quantity expression that goes undetected, please file an issue !

Pipe definition

text = """Poids : 65. Taille : 1.75
          On mesure ... à 3mmol/l ; pression : 100mPa-110mPa.
          Acte réalisé par ... à 12h13"""

All quantitiesCustom quantitiesPredefined quantities

import edsnlp

nlp = edsnlp.blank("eds")
nlp.add_pipe("eds.sentences")
nlp.add_pipe("eds.tables")
nlp.add_pipe(
    "eds.quantities",
    config=dict(
        quantities="all", extract_ranges=True, use_tables=True  # (3)  # (1)
    ),  # (2)
)
nlp(text).spans["quantities"]
# Out: [65, 1.75, 3mmol/l, 100mPa-110mPa, 12h13]

100-110mg, 2 à 4 jours ...
If True eds.tables must be called
All units from Availability will be detected

import edsnlp

nlp = edsnlp.blank("eds")
nlp.add_pipe("eds.sentences")
nlp.add_pipe("eds.tables")
nlp.add_pipe(
    "eds.quantities",
    config=dict(
        quantities={
            "concentration": {"unit": "mol_per_l"},
            "pressure": {"unit": "Pa"},
        },  # (3)
        extract_ranges=True,  # (1)
        use_tables=True,
    ),  # (2)
)
nlp(text).spans["quantities"]
# Out: [3mmol/l, 100mPa-110mPa]

100-110mg, 2 à 4 jours ...
If True eds.tables must be called
Which units are available ? See Availability. More on customization ? See Customization

import edsnlp

nlp = edsnlp.blank("eds")
nlp.add_pipe("eds.sentences")
nlp.add_pipe("eds.tables")
nlp.add_pipe(
    "eds.quantities",
    config=dict(
        quantities=["weight", "size"],  # (3)
        extract_ranges=True,  # (1)
        use_tables=True,
    ),  # (2)
)
nlp(text).spans["quantities"]
# Out: [65, 1.75]

100-110mg, 2 à 4 jours ...
If True eds.tables must be called
Which quantities are available ? See Availability

Scope

The eds.quantities matcher can extract simple (e.g. 3cm) quantities. It can also detect elliptic enumerations (eg 32, 33 et 34kg) of quantities of the same type and split the quantities accordingly.

The normalized value can then be accessed via the span._.{measure_name} attribute, for instance span._.size or span._.weight and be converted on the fly to a desired unit. Like for other components, the span._.value extension can also be used to access the normalized value for any quantity span.

See Availability section for details on which units are handled

Examples

import edsnlp, edsnlp.pipes as eds

nlp = edsnlp.blank("eds")
nlp.add_pipe(
    eds.quantities(
        quantities=["size", "weight", "bmi"],
        extract_ranges=True,
    ),
)

text = """
Le patient est admis hier, fait 1m78 pour 76kg.
Les deux nodules bénins sont larges de 1,2 et 2.4mm.
BMI: 24.

Le nodule fait entre 1 et 1.5 cm
"""

doc = nlp(text)

quantities = doc.spans["quantities"]

quantities
# Out: [1m78, 76kg, 1,2, 2.4mm, 24, entre 1 et 1.5 cm]

quantities[0]
# Out: 1m78

str(quantities[0]._.size), str(quantities[0]._.value)
# Out: ('1.78 m', '1.78 m')

quantities[0]._.value.cm
# Out: 178.0

quantities[2]
# Out: 1,2

str(quantities[2]._.value)
# Out: '1.2 mm'

str(quantities[2]._.value.mm)
# Out: 1.2

quantities[4]
# Out: 24

str(quantities[4]._.value)
# Out: '24 kg_per_m2'

str(quantities[4]._.value.kg_per_m2)
# Out: 24

str(quantities[5]._.value)
# Out: 1-1.5 cm

To extract all sizes in centimeters, and average range quantities, you can use the following snippet:

sizes = [
    sum(item.cm for item in m._.value) / len(m._.value)
    for m in doc.spans["quantities"]
    if m.label_ == "size"
]
sizes
# Out: [178.0, 0.12, 0.24, 1.25]

To extract the quantities from many texts, you can use the following snippet:

import edsnlp, edsnlp.pipes as eds

nlp = edsnlp.blank("eds")
nlp.add_pipe(
    eds.quantities(quantities="weight", extract_ranges=True, as_ents=True),
)
texts = ["Le patient mesure 40000,0 g (aussi noté 40 kg)"]
docs = edsnlp.data.from_iterable(texts)
docs = docs.map_pipeline(nlp)
docs.to_pandas(
    converter="ents",
    span_attributes=["value.unit", "value.kg"],
)
#   note_id  start  end   label lexical_variant span_type original_unit    kg
# 0    None     18   27  weight       40000,0 g      ents             g  40.0
# 1    None     40   45  weight           40 kg      ents            kg  40.0

Available units and quantities

Feel free to propose any missing raw unit or predefined quantity.

Raw units and their derivations (g, mg, mgr ...) and their compositions (g/ml, cac/j ...) can be detected.

Available raw units :

g, m, m2, m3, mol, ui, Pa, %, log, mmHg, s/min/h/d/w/m/y, arc-second, °, °C, cac, goutte, l, x10*4, x10*5

Available predefined quantities :

quantity_name	Example
`size`	`1m50`, `1.50m`...
`weight`	`1kg`, `Poids : 65`...
`bmi`	`BMI: 24`, `24 kg.m-2`
`volume`	`2 cac`, `8ml`...

See the patterns for exhaustive definition.

Customization

You can declare custom quantities by altering the patterns:

import edsnlp, edsnlp.pipes as eds

nlp = edsnlp.blank("eds")
nlp.add_pipe(
    eds.quantities(
        quantities={
            "my_custom_surface_quantity": {
                # This quantity unit is homogenous to square meters
                "unit": "m2",
                # Handle cases like "surface: 1.8" (implied m2),
                # vs "surface: 50" (implied cm2)
                "unitless_patterns": [
                    {
                        "terms": ["surface", "aire"],
                        "ranges": [
                            {"unit": "m2", "min": 0, "max": 9},
                            {"unit": "cm2", "min": 10, "max": 100},
                        ],
                    }
                ],
            },
        }
    ),
)

Extensions

The eds.quantities pipeline declares its extensions dynamically, depending on the quantities parameter: each quantity gets its own extension, and is assigned to a different span group.

Parameters

PARAMETER	DESCRIPTION
`nlp`	The pipeline object TYPE: `PipelineProtocol`
`name`	The name of the component. TYPE: `str` DEFAULT: `'quantities'`
`quantities`	A mapping from measure names to MsrConfig Each measure's configuration has the following shape: `{ # the unit (e.g. "kg"), "unit": str, "unitless_patterns": { # preceding trigger terms "terms": List[str], # unitless ranges -> unit patterns "ranges": List[ {"min": int, "max": int, "unit": str}, {"min": int, "unit": str}, ..., ], ... } }` Set `quantities="all"` to extract all raw quantities from units_config file. TYPE: `Union[str, List[Union[str, MsrConfig]], Dict[str, MsrConfig]]` DEFAULT: `['weight', 'size', 'bmi', 'volume']`
`number_terms`	A mapping of numbers to their lexical variants TYPE: `Dict[str, List[str]]` DEFAULT: `{'0.125': ['⅛'], '0.16666666': ['⅙'], '0.2': ['...`
`stopwords`	A list of stopwords that do not matter when placed between a unitless trigger and a number TYPE: `List[str]` DEFAULT: `['par', 'sur', 'de', 'a', ',', 'et', '-', 'à']`
`unit_divisors`	A list of terms used to divide two units (like: m / s) TYPE: `List[str]` DEFAULT: `['/', 'par']`
`attr`	Whether to match on the text ('TEXT') or on the normalized text ('NORM') TYPE: `str` DEFAULT: `NORM`
`ignore_excluded`	Whether to exclude pollution patterns when matching in the text TYPE: `bool` DEFAULT: `True`
`compose_units`	Whether to compose units (like "m/s" or "m.s-1") TYPE: `bool` DEFAULT: `True`
`extract_ranges`	Whether to extract ranges (like "entre 1 et 2 cm") TYPE: `bool` DEFAULT: `False`
`range_patterns`	A list of "{FROM} xx {TO} yy" patterns to match range quantities TYPE: `List[Tuple[Optional[str], Optional[str]]]` DEFAULT: `[('De', 'à'), ('De', 'a'), ('de', 'à'), ('de', ...`
`after_snippet_limit`	Maximum word distance after to link a part of a quantity after its number TYPE: `int` DEFAULT: `6`
`before_snippet_limit`	Maximum word distance after to link a part of a quantity before its number TYPE: `int` DEFAULT: `10`
`span_setter`	How to set the spans in the document. By default, each quantity will be assigned to its own span group (using either the "name" field of the config, or the key if you passed a dict), and to the "quantities" group. TYPE: `Optional[SpanSetterArg]` DEFAULT: `None`
`span_getter`	Where to look for quantities in the doc. By default, look in the whole doc. You can combine this with the `merge_mode` argument for interesting results. TYPE: `SpanGetterArg` DEFAULT: `None`
`merge_mode`	How to merge matches with the spans from `span_getter`, if given: `intersect`: return only the matches that fall in the `span_getter` spans `align`: if a match overlaps a span from `span_getter` (e.g. a match extracted by a machine learning model), return the `span_getter` span instead, and assign all the parsed information (`._.date` / `._.duration`) to it. Otherwise, don't return the date. TYPE: `Literal['intersect', 'align']` DEFAULT: `intersect`

Authors and citation

The eds.quantities pipeline was developed by AP-HP's Data Science team.