Quantities
The eds.quantities
matcher detects and normalizes numerical quantities within a medical document.
Warning
The quantities
pipeline is still in active development and has not been rigorously validated. If you come across a quantity expression that goes undetected, please file an issue !
Pipe definition
text = """Poids : 65. Taille : 1.75
On mesure ... à 3mmol/l ; pression : 100mPa-110mPa.
Acte réalisé par ... à 12h13"""
import edsnlp
nlp = edsnlp.blank("eds")
nlp.add_pipe("eds.sentences")
nlp.add_pipe("eds.tables")
nlp.add_pipe(
"eds.quantities",
config=dict(
quantities="all", extract_ranges=True, use_tables=True # (3) # (1)
), # (2)
)
nlp(text).spans["quantities"]
# Out: [65, 1.75, 3mmol/l, 100mPa-110mPa, 12h13]
- 100-110mg, 2 à 4 jours ...
- If True
eds.tables
must be called - All units from Availability will be detected
import edsnlp
nlp = edsnlp.blank("eds")
nlp.add_pipe("eds.sentences")
nlp.add_pipe("eds.tables")
nlp.add_pipe(
"eds.quantities",
config=dict(
quantities={
"concentration": {"unit": "mol_per_l"},
"pressure": {"unit": "Pa"},
}, # (3)
extract_ranges=True, # (1)
use_tables=True,
), # (2)
)
nlp(text).spans["quantities"]
# Out: [3mmol/l, 100mPa-110mPa]
- 100-110mg, 2 à 4 jours ...
- If True
eds.tables
must be called - Which units are available ? See Availability. More on customization ? See Customization
import edsnlp
nlp = edsnlp.blank("eds")
nlp.add_pipe("eds.sentences")
nlp.add_pipe("eds.tables")
nlp.add_pipe(
"eds.quantities",
config=dict(
quantities=["weight", "size"], # (3)
extract_ranges=True, # (1)
use_tables=True,
), # (2)
)
nlp(text).spans["quantities"]
# Out: [65, 1.75]
- 100-110mg, 2 à 4 jours ...
- If True
eds.tables
must be called - Which quantities are available ? See Availability
Scope
The eds.quantities
matcher can extract simple (e.g. 3cm
) quantities. It can also detect elliptic enumerations (eg 32, 33 et 34kg
) of quantities of the same type and split the quantities accordingly.
The normalized value can then be accessed via the span._.{measure_name}
attribute, for instance span._.size
or span._.weight
and be converted on the fly to a desired unit. Like for other components, the span._.value
extension can also be used to access the normalized value for any quantity span.
See Availability section for details on which units are handled
Examples
import edsnlp, edsnlp.pipes as eds
nlp = edsnlp.blank("eds")
nlp.add_pipe(
eds.quantities(
quantities=["size", "weight", "bmi"],
extract_ranges=True,
),
)
text = """
Le patient est admis hier, fait 1m78 pour 76kg.
Les deux nodules bénins sont larges de 1,2 et 2.4mm.
BMI: 24.
Le nodule fait entre 1 et 1.5 cm
"""
doc = nlp(text)
quantities = doc.spans["quantities"]
quantities
# Out: [1m78, 76kg, 1,2, 2.4mm, 24, entre 1 et 1.5 cm]
quantities[0]
# Out: 1m78
str(quantities[0]._.size), str(quantities[0]._.value)
# Out: ('1.78 m', '1.78 m')
quantities[0]._.value.cm
# Out: 178.0
quantities[2]
# Out: 1,2
str(quantities[2]._.value)
# Out: '1.2 mm'
str(quantities[2]._.value.mm)
# Out: 1.2
quantities[4]
# Out: 24
str(quantities[4]._.value)
# Out: '24 kg_per_m2'
str(quantities[4]._.value.kg_per_m2)
# Out: 24
str(quantities[5]._.value)
# Out: 1-1.5 cm
To extract all sizes in centimeters, and average range quantities, you can use the following snippet:
sizes = [
sum(item.cm for item in m._.value) / len(m._.value)
for m in doc.spans["quantities"]
if m.label_ == "size"
]
sizes
# Out: [178.0, 0.12, 0.24, 1.25]
To extract the quantities from many texts, you can use the following snippet:
import edsnlp, edsnlp.pipes as eds
nlp = edsnlp.blank("eds")
nlp.add_pipe(
eds.quantities(quantities="weight", extract_ranges=True, as_ents=True),
)
texts = ["Le patient mesure 40000,0 g (aussi noté 40 kg)"]
docs = edsnlp.data.from_iterable(texts)
docs = docs.map_pipeline(nlp)
docs.to_pandas(
converter="ents",
span_attributes={"value.unit": "original_unit", "value.kg": "kg"},
)
# note_id start end label lexical_variant span_type original_unit kg
# 0 None 18 27 weight 40000,0 g ents g 40.0
# 1 None 40 45 weight 40 kg ents kg 40.0
Available units and quantities
Feel free to propose any missing raw unit or predefined quantity.
Raw units and their derivations (g, mg, mgr ...) and their compositions (g/ml, cac/j ...) can be detected.
Available raw units :
g, m, m2, m3, mol, ui, Pa, %, log, mmHg, s/min/h/d/w/m/y, arc-second, °, °C, cac, goutte, l, x10*4, x10*5
Available predefined quantities :
quantity_name | Example |
---|---|
size | 1m50 , 1.50m ... |
weight | 1kg , Poids : 65 ... |
bmi | BMI: 24 , 24 kg.m-2 |
volume | 2 cac , 8ml ... |
See the patterns for exhaustive definition.
Customization
You can declare custom quantities by altering the patterns:
import edsnlp, edsnlp.pipes as eds
nlp = edsnlp.blank("eds")
nlp.add_pipe(
eds.quantities(
quantities={
"my_custom_surface_quantity": {
# This quantity unit is homogenous to square meters
"unit": "m2",
# Handle cases like "surface: 1.8" (implied m2),
# vs "surface: 50" (implied cm2)
"unitless_patterns": [
{
"terms": ["surface", "aire"],
"ranges": [
{"unit": "m2", "min": 0, "max": 9},
{"unit": "cm2", "min": 10, "max": 100},
],
}
],
},
}
),
)
Extensions
The eds.quantities
pipeline declares its extensions dynamically, depending on the quantities
parameter: each quantity gets its own extension, and is assigned to a different span group.
Parameters
PARAMETER | DESCRIPTION |
---|---|
nlp | The pipeline object TYPE: |
name | The name of the component. TYPE: |
quantities | A mapping from measure names to MsrConfig Each measure's configuration has the following shape:
quantities="all" to extract all raw quantities from units_config file. TYPE: |
number_terms | A mapping of numbers to their lexical variants TYPE: |
stopwords | A list of stopwords that do not matter when placed between a unitless trigger and a number TYPE: |
unit_divisors | A list of terms used to divide two units (like: m / s) TYPE: |
attr | Whether to match on the text ('TEXT') or on the normalized text ('NORM') TYPE: |
ignore_excluded | Whether to exclude pollution patterns when matching in the text TYPE: |
compose_units | Whether to compose units (like "m/s" or "m.s-1") TYPE: |
extract_ranges | Whether to extract ranges (like "entre 1 et 2 cm") TYPE: |
range_patterns | A list of "{FROM} xx {TO} yy" patterns to match range quantities TYPE: |
after_snippet_limit | Maximum word distance after to link a part of a quantity after its number TYPE: |
before_snippet_limit | Maximum word distance after to link a part of a quantity before its number TYPE: |
span_setter | How to set the spans in the document. By default, each quantity will be assigned to its own span group (using either the "name" field of the config, or the key if you passed a dict), and to the "quantities" group. TYPE: |
span_getter | Where to look for quantities in the doc. By default, look in the whole doc. You can combine this with the TYPE: |
merge_mode | How to merge matches with the spans from
TYPE: |
Authors and citation
The eds.quantities
pipeline was developed by AP-HP's Data Science team.