Skip to content

Quantities

The eds.quantities matcher detects and normalizes numerical quantities within a medical document.

Warning

The quantities pipeline is still in active development and has not been rigorously validated. If you come across a quantity expression that goes undetected, please file an issue !

Pipe definition

text = """Poids : 65. Taille : 1.75
          On mesure ... à 3mmol/l ; pression : 100mPa-110mPa.
          Acte réalisé par ... à 12h13"""
import edsnlp

nlp = edsnlp.blank("eds")
nlp.add_pipe("eds.sentences")
nlp.add_pipe("eds.tables")
nlp.add_pipe(
    "eds.quantities",
    config=dict(
        quantities="all", extract_ranges=True, use_tables=True  # (3)  # (1)
    ),  # (2)
)
nlp(text).spans["quantities"]
# Out: [65, 1.75, 3mmol/l, 100mPa-110mPa, 12h13]
  1. 100-110mg, 2 à 4 jours ...
  2. If True eds.tables must be called
  3. All units from Availability will be detected
import edsnlp

nlp = edsnlp.blank("eds")
nlp.add_pipe("eds.sentences")
nlp.add_pipe("eds.tables")
nlp.add_pipe(
    "eds.quantities",
    config=dict(
        quantities={
            "concentration": {"unit": "mol_per_l"},
            "pressure": {"unit": "Pa"},
        },  # (3)
        extract_ranges=True,  # (1)
        use_tables=True,
    ),  # (2)
)
nlp(text).spans["quantities"]
# Out: [3mmol/l, 100mPa-110mPa]
  1. 100-110mg, 2 à 4 jours ...
  2. If True eds.tables must be called
  3. Which units are available ? See Availability. More on customization ? See Customization
import edsnlp

nlp = edsnlp.blank("eds")
nlp.add_pipe("eds.sentences")
nlp.add_pipe("eds.tables")
nlp.add_pipe(
    "eds.quantities",
    config=dict(
        quantities=["weight", "size"],  # (3)
        extract_ranges=True,  # (1)
        use_tables=True,
    ),  # (2)
)
nlp(text).spans["quantities"]
# Out: [65, 1.75]
  1. 100-110mg, 2 à 4 jours ...
  2. If True eds.tables must be called
  3. Which quantities are available ? See Availability

Scope

The eds.quantities matcher can extract simple (e.g. 3cm) quantities. It can also detect elliptic enumerations (eg 32, 33 et 34kg) of quantities of the same type and split the quantities accordingly.

The normalized value can then be accessed via the span._.{measure_name} attribute, for instance span._.size or span._.weight and be converted on the fly to a desired unit. Like for other components, the span._.value extension can also be used to access the normalized value for any quantity span.

See Availability section for details on which units are handled

Examples

import edsnlp, edsnlp.pipes as eds

nlp = edsnlp.blank("eds")
nlp.add_pipe(
    eds.quantities(
        quantities=["size", "weight", "bmi"],
        extract_ranges=True,
    ),
)

text = """
Le patient est admis hier, fait 1m78 pour 76kg.
Les deux nodules bénins sont larges de 1,2 et 2.4mm.
BMI: 24.

Le nodule fait entre 1 et 1.5 cm
"""

doc = nlp(text)

quantities = doc.spans["quantities"]

quantities
# Out: [1m78, 76kg, 1,2, 2.4mm, 24, entre 1 et 1.5 cm]

quantities[0]
# Out: 1m78

str(quantities[0]._.size), str(quantities[0]._.value)
# Out: ('1.78 m', '1.78 m')

quantities[0]._.value.cm
# Out: 178.0

quantities[2]
# Out: 1,2

str(quantities[2]._.value)
# Out: '1.2 mm'

str(quantities[2]._.value.mm)
# Out: 1.2

quantities[4]
# Out: 24

str(quantities[4]._.value)
# Out: '24 kg_per_m2'

str(quantities[4]._.value.kg_per_m2)
# Out: 24

str(quantities[5]._.value)
# Out: 1-1.5 cm

To extract all sizes in centimeters, and average range quantities, you can use the following snippet:

sizes = [
    sum(item.cm for item in m._.value) / len(m._.value)
    for m in doc.spans["quantities"]
    if m.label_ == "size"
]
sizes
# Out: [178.0, 0.12, 0.24, 1.25]

To extract the quantities from many texts, you can use the following snippet:

import edsnlp, edsnlp.pipes as eds

nlp = edsnlp.blank("eds")
nlp.add_pipe(
    eds.quantities(quantities="weight", extract_ranges=True, as_ents=True),
)
texts = ["Le patient mesure 40000,0 g (aussi noté 40 kg)"]
docs = edsnlp.data.from_iterable(texts)
docs = docs.map_pipeline(nlp)
docs.to_pandas(
    converter="ents",
    span_attributes={"value.unit": "original_unit", "value.kg": "kg"},
)
#   note_id  start  end   label lexical_variant span_type original_unit    kg
# 0    None     18   27  weight       40000,0 g      ents             g  40.0
# 1    None     40   45  weight           40 kg      ents            kg  40.0

Available units and quantities

Feel free to propose any missing raw unit or predefined quantity.

Raw units and their derivations (g, mg, mgr ...) and their compositions (g/ml, cac/j ...) can be detected.

Available raw units :

g, m, m2, m3, mol, ui, Pa, %, log, mmHg, s/min/h/d/w/m/y, arc-second, °, °C, cac, goutte, l, x10*4, x10*5

Available predefined quantities :

quantity_name Example
size 1m50, 1.50m...
weight 1kg, Poids : 65...
bmi BMI: 24, 24 kg.m-2
volume 2 cac, 8ml...

See the patterns for exhaustive definition.

Customization

You can declare custom quantities by altering the patterns:

import edsnlp, edsnlp.pipes as eds

nlp = edsnlp.blank("eds")
nlp.add_pipe(
    eds.quantities(
        quantities={
            "my_custom_surface_quantity": {
                # This quantity unit is homogenous to square meters
                "unit": "m2",
                # Handle cases like "surface: 1.8" (implied m2),
                # vs "surface: 50" (implied cm2)
                "unitless_patterns": [
                    {
                        "terms": ["surface", "aire"],
                        "ranges": [
                            {"unit": "m2", "min": 0, "max": 9},
                            {"unit": "cm2", "min": 10, "max": 100},
                        ],
                    }
                ],
            },
        }
    ),
)

Extensions

The eds.quantities pipeline declares its extensions dynamically, depending on the quantities parameter: each quantity gets its own extension, and is assigned to a different span group.

Parameters

PARAMETER DESCRIPTION
nlp

The pipeline object

TYPE: PipelineProtocol

name

The name of the component.

TYPE: str DEFAULT: 'quantities'

quantities

A mapping from measure names to MsrConfig Each measure's configuration has the following shape:

{
  # the unit (e.g. "kg"),
  "unit": str,
  "unitless_patterns": {
    # preceding trigger terms
    "terms": List[str],
    # unitless ranges -> unit patterns
    "ranges": List[
      {"min": int, "max": int, "unit": str},
      {"min": int, "unit": str},
      ...,
    ],
    ...
  }
}
Set quantities="all" to extract all raw quantities from units_config file.

TYPE: Union[str, List[Union[str, MsrConfig]], Dict[str, MsrConfig]] DEFAULT: ['weight', 'size', 'bmi', 'volume']

number_terms

A mapping of numbers to their lexical variants

TYPE: Dict[str, List[str]] DEFAULT: {'0.125': ['⅛'], '0.16666666': ['⅙'], '0.2': ['...

stopwords

A list of stopwords that do not matter when placed between a unitless trigger and a number

TYPE: List[str] DEFAULT: ['par', 'sur', 'de', 'a', ',', 'et', '-', 'à']

unit_divisors

A list of terms used to divide two units (like: m / s)

TYPE: List[str] DEFAULT: ['/', 'par']

attr

Whether to match on the text ('TEXT') or on the normalized text ('NORM')

TYPE: str DEFAULT: NORM

ignore_excluded

Whether to exclude pollution patterns when matching in the text

TYPE: bool DEFAULT: True

compose_units

Whether to compose units (like "m/s" or "m.s-1")

TYPE: bool DEFAULT: True

extract_ranges

Whether to extract ranges (like "entre 1 et 2 cm")

TYPE: bool DEFAULT: False

range_patterns

A list of "{FROM} xx {TO} yy" patterns to match range quantities

TYPE: List[Tuple[Optional[str], Optional[str]]] DEFAULT: [('De', 'à'), ('De', 'a'), ('de', 'à'), ('de', ...

after_snippet_limit

Maximum word distance after to link a part of a quantity after its number

TYPE: int DEFAULT: 6

before_snippet_limit

Maximum word distance after to link a part of a quantity before its number

TYPE: int DEFAULT: 10

span_setter

How to set the spans in the document. By default, each quantity will be assigned to its own span group (using either the "name" field of the config, or the key if you passed a dict), and to the "quantities" group.

TYPE: Optional[SpanSetterArg] DEFAULT: None

span_getter

Where to look for quantities in the doc. By default, look in the whole doc. You can combine this with the merge_mode argument for interesting results.

TYPE: SpanGetterArg DEFAULT: None

merge_mode

How to merge matches with the spans from span_getter, if given:

  • intersect: return only the matches that fall in the span_getter spans
  • align: if a match overlaps a span from span_getter (e.g. a match extracted by a machine learning model), return the span_getter span instead, and assign all the parsed information (._.date / ._.duration) to it. Otherwise, don't return the date.

TYPE: Literal['intersect', 'align'] DEFAULT: intersect

Authors and citation

The eds.quantities pipeline was developed by AP-HP's Data Science team.