Skip to content

Tobacco consumption

The eds.tobacco pipeline component extracts mentions of tobacco consumption.

Details of the used patterns
# fmt: off
PA = r"(?:\bp/?a\b|paquets?.?annee)"
QUANTITY = r"(?P<quantity>[\d]{1,3})"
PUNCT = r"\.,-;\(\)"

default_patterns = [
    dict(
        source="tobacco",
        regex=[
            r"tabagi",
            r"tabac",
            r"\bfume\b",
            r"\bfumeu",
            r"\bpipes?\b",
        ],
        exclude=dict(
            regex=[
                "occasion",
                "moder",
                "quelqu",
                "festi",
                "rare",
                "sujet",  # Example : Chez le sujet fumeur ... generic sentences
            ],
            window=(-3, 5),
        ),
        regex_attr="NORM",
        assign=[
            dict(
                name="stopped",
                regex=r"(\bex\b|sevr|arret|stop|ancien)",
                window=(-3, 15),
                reduce_mode="keep_first",
            ),
            dict(
                name="zero_after",
                regex=r"(?=^[a-z]*\s*:?[\s-]*(0|non|aucun|jamais))",
                window=3,
                reduce_mode="keep_first",
            ),
            dict(
                name="PA",
                regex=rf"{QUANTITY}[^{PUNCT}]{{0,10}}{PA}|{PA}[^{PUNCT}]{{0,10}}{QUANTITY}",
                window=(-10, 10),
                reduce_mode="keep_first",
            ),
            dict(
                name="secondhand",
                regex="(passif)",
                window=5,
                reduce_mode="keep_first",
            ),
        ],
    )
]
# fmt: on

Extensions

On each span span that match, the following attributes are available:

  • span._.detailed_status: either None or "ABSTINENCE" if the patient stopped its consumption
  • span._.assigned: dictionary with the following keys, if relevant:
    • PA: the mentioned year-pack (= paquet-année)
    • secondhand: if secondhand smoking
  • span._.negation: set to True when either
    • A pack-year value of 0 is extracted
    • A mention such as "tabac: 0" is found
    • The patient experiences secondhand smoking

Use qualifiers !

Although the tobacco pipe sometime sets value for the negation attribute, generic qualifier should still be used after the pipe.

Examples

import edsnlp, edsnlp.pipes as eds

nlp = edsnlp.blank("eds")
nlp.add_pipe(eds.sentences())
nlp.add_pipe(
    eds.normalizer(
        accents=True,
        lowercase=True,
        quotes=True,
        spaces=True,
        pollution=dict(
            information=True,
            bars=True,
            biology=True,
            doctors=True,
            web=True,
            coding=True,
            footer=True,
        ),
    ),
)
nlp.add_pipe(eds.tobacco())

Below are a few examples:

text = "Tabagisme évalué à 15 PA"
doc = nlp(text)
spans = doc.spans["tobacco"]

spans
# Out: [Tabagisme évalué à 15 PA]

span = spans[0]

span._.assigned
# Out: {'PA': 15}
text = "Patient tabagique"
doc = nlp(text)
spans = doc.spans["tobacco"]

spans
# Out: [tabagique]
text = "Tabagisme festif"
doc = nlp(text)
spans = doc.spans["tobacco"]

spans
# Out: []
text = "On a un tabagisme ancien"
doc = nlp(text)
spans = doc.spans["tobacco"]

spans
# Out: [tabagisme ancien]

span = spans[0]

span._.detailed_status
# Out: ABSTINENCE

span._.assigned
# Out: {'stopped': ancien}
text = "Tabac: 0"
doc = nlp(text)
spans = doc.spans["tobacco"]

spans
# Out: [Tabac: 0]

span = spans[0]

span._.detailed_status
# Out: None

span._.negation
# Out: True

span._.assigned
# Out: {'zero_after': [0]}
text = "Tabagisme passif"
doc = nlp(text)
spans = doc.spans["tobacco"]

spans
# Out: [Tabagisme passif]

span = spans[0]

span._.detailed_status
# Out: None

span._.negation
# Out: True

span._.assigned
# Out: {'secondhand': passif}
text = "Tabac: sevré depuis 5 ans"
doc = nlp(text)
spans = doc.spans["tobacco"]

spans
# Out: [Tabac: sevré]

span = spans[0]

span._.detailed_status
# Out: ABSTINENCE

span._.assigned
# Out: {'stopped': sevré}

Parameters

PARAMETER DESCRIPTION
nlp

The pipeline object

TYPE: Optional[PipelineProtocol]

name

The name of the component

TYPE: Optional[str]

patterns

The patterns to use for matching

DEFAULT: [{'source': 'tobacco', 'regex': ['tabagi', 'tab...

label

The label to use for the Span object and the extension

TYPE: str DEFAULT: tobacco

span_setter

How to set matches on the doc

TYPE: SpanSetterArg DEFAULT: {'ents': True, 'tobacco': True}

Authors and citation

The eds.tobacco component was developed by AP-HP's Data Science team with a team of medical experts, following the insights of the algorithm proposed by Petit-Jean et al., 2024.


  1. Petit-Jean T., Gérardin C., Berthelot E., Chatellier G., Frank M., Tannier X., Kempf E. and Bey R., 2024. Collaborative and privacy-enhancing workflows on a clinical data warehouse: an example developing natural language processing pipelines to detect medical conditions. Journal of the American Medical Informatics Association. 31, pp.1280-1290. 10.1093/jamia/ocae069