Skip to content

Solid tumor

The eds.solid_tumor pipeline component extracts mentions of solid tumors. It will notably match:

Details of the used patterns
# fmt: off
BENINE = r"benign|benin|(grade.?\b[i1]\b)"
STAGE = r"stade ([^\s]*)"

main_pattern = dict(
    source="main",
    regex=[
        r"carcinom(?!.{0,10}in.?situ)",
        r"seminome",
        r"(?<!lympho)(?<!lympho-)sarcome",
        r"blastome",
        r"cancer([^o]|\s|\b)",
        r"adamantinome",
        r"chordome",
        r"craniopharyngiome",
        r"melanome",
        r"neoplasie",
        r"neoplasme",
        r"linite",
        r"melanome",
        r"mesoteliome",
        r"mesotheliome",
        r"seminome",
        r"myxome",
        r"paragangliome",
        r"craniopharyngiome",
        r"k .{0,5}(prostate|sein)",
        r"pancoast.?tobias",
        r"syndrome.{1,10}lynch",
        r"li.?fraumeni",
        r"germinome",
        r"adeno[\s-]?k",
        r"thymome",
        r"\bnut\b",
        r"\bgist\b",
        r"\bchc\b",
        r"\badk\b",
        r"\btves\b",
        r"\btv.tves\b",
        r"lesion.{1,20}tumor",
        r"tumeur",
        r"carcinoid",
        r"histiocytome",
        r"ependymome",
        # r"primitif", Trop de FP
    ],
    exclude=dict(
        regex=BENINE,
        window=(0, 5),
    ),
    regex_attr="NORM",
    assign=[
        dict(
            name="metastasis",
            regex=r"(metasta|multinodul)",
            window=(-3, 7),
            reduce_mode="keep_last",
        ),
        dict(
            name="stage",
            regex=STAGE,
            window=7,
            reduce_mode="keep_last",
        ),
    ],
)

metastasis_pattern = dict(
    source="metastasis",
    regex=[
        r"cellule.{1,5}tumorale.{1,5}circulantes",
        r"metasta",
        r"multinodul",
        r"carcinose",
        r"ruptures.{1,5}corticale",
        r"envahissement.{0,15}parties\smolle",
        r"(localisation|lesion)s?.{0,20}second",
        r"(lymphangite|meningite).{1,5}carcinomateuse",
    ],
    regex_attr="NORM",
    exclude=dict(
        regex=r"goitre",
        window=-3,
    ),
)

# Patterns developed for CT-Scan reports
metastasis_ct_scan = dict(
    source="metastasis_ct_scan",
    regex=[
        r"(?i)(m[ée]tasta(se|tique)s?)",
        r"(diss[ée]min[ée]e?s?)",
        r"(carcinose)",
        r"(((allure|l[ée]sion|localisation|progression)s?\s)(suspecte?s?)?.{0,50}(secondaire)s?)",
        r"(l(a|â)ch(é|e|er)\sde\sballons?)",
        r"(l[ée]sions?\s(non\s)?cibles?)",
        r"(rupture.{1,20}corticale)",
        r"(envahissement.{0,15}parties\smolles)",
        r"((l[i,y]se).{1,20}os)|ost[eé]ol[i,y]|rupture.{1,20}corticale|envahissement.{1,20}parties\smolles|ost[eé]ocondensa.{1,20}(suspect|secondaire|[ée]volutive)",
        r"(l[ée]sion|anomalie|image).{1,20}os.{1,30}(suspect|secondaire|[ée]volutive)",
        r"os.{1,30}(l[ée]sion|anomalie|image).{1,20}(suspect|secondaire|[ée]volutive)",
        r"(l[ée]sion|anomalie|image).{1,20}l[i,y]tique",
        r"(l[ée]sion|anomalie|image).{1,20}condensant.{1,20}(suspect|secondaire|[ée]volutive)",
        r"fracture.{1,30}(suspect|secondaire|[ée]volutive)",
        r"((l[ée]sion|anomalie|image|nodule).{1,80}(secondaire))",
        r"((l[ée]sion|anomalie|image|nodule)s.{1,40}suspec?ts?)",
    ],
    regex_attr="NORM",
)

default_patterns = [
    main_pattern,
    metastasis_pattern,
]
# fmt: on

Extensions

On each span span that match, the following attributes are available:

  • span._.detailed_status: set to either
    • "METASTASIS" for tumors at the metastatic stage
    • "LOCALIZED" else
  • span._.assigned: dictionary with the following keys, if relevant:
    • stage: stage of the tumor

Examples

import edsnlp, edsnlp.pipes as eds

nlp = edsnlp.blank("eds")
nlp.add_pipe(eds.sentences())
nlp.add_pipe(
    eds.normalizer(
        accents=True,
        lowercase=True,
        quotes=True,
        spaces=True,
        pollution=dict(
            information=True,
            bars=True,
            biology=True,
            doctors=True,
            web=True,
            coding=True,
            footer=True,
        ),
    ),
)
nlp.add_pipe(eds.solid_tumor())

Below are a few examples:

text = "Présence d'un carcinome intra-hépatique."
doc = nlp(text)
spans = doc.spans["solid_tumor"]

spans
# Out: [carcinome]
text = "Patient avec un K sein."
doc = nlp(text)
spans = doc.spans["solid_tumor"]

spans
# Out: [K sein]
text = "Il y a une tumeur bénigne"
doc = nlp(text)
spans = doc.spans["solid_tumor"]

spans
# Out: []
text = "Tumeur métastasée"
doc = nlp(text)
spans = doc.spans["solid_tumor"]

spans
# Out: [Tumeur métastasée]

span = spans[0]

span._.detailed_status
# Out: METASTASIS

span._.assigned
# Out: {'metastasis': métastasée}
text = "Cancer du poumon au stade 4"
doc = nlp(text)
spans = doc.spans["solid_tumor"]

spans
# Out: [Cancer du poumon au stade 4]

span = spans[0]

span._.detailed_status
# Out: METASTASIS

span._.assigned
# Out: {'stage': 4}
text = "Cancer du poumon au stade 2"
doc = nlp(text)
spans = doc.spans["solid_tumor"]

spans
# Out: [Cancer du poumon au stade 2]

span = spans[0]

span._.assigned
# Out: {'stage': 2}
text = "Présence de nombreuses lésions secondaires"
doc = nlp(text)
spans = doc.spans["solid_tumor"]

spans
# Out: [lésions secondaires]

span = spans[0]

span._.detailed_status
# Out: METASTASIS

Parameters

PARAMETER DESCRIPTION
nlp

The pipeline

TYPE: Optional[PipelineProtocol]

name

The name of the component

TYPE: Optional[str]

patterns

The patterns to use for matching

DEFAULT: [{'source': 'main', 'regex': ['carcinom(?!.{0,1...

label

The label to use for the Span object and the extension

TYPE: str DEFAULT: solid_tumor

span_setter

How to set matches on the doc

TYPE: SpanSetterArg DEFAULT: {'ents': True, 'solid_tumor': True}

use_tnm

Whether to use TNM scores matching as well

TYPE: bool DEFAULT: False

use_patterns_metastasis_ct_scan

Whether to use the metastasis patterns developed for the CT-Scans

TYPE: bool DEFAULT: False

Authors and citation

The eds.solid_tumor component was developed by AP-HP's Data Science team with a team of medical experts, following the insights of the algorithm proposed by Petit-Jean et al., 2024 and Kempf et al., 2022.


  1. Petit-Jean T., Gérardin C., Berthelot E., Chatellier G., Frank M., Tannier X., Kempf E. and Bey R., 2024. Collaborative and privacy-enhancing workflows on a clinical data warehouse: an example developing natural language processing pipelines to detect medical conditions. Journal of the American Medical Informatics Association. 31, pp.1280-1290. 10.1093/jamia/ocae069

  2. Kempf E., Priou S., Lamé G., Daniel C., Bellamine A., Sommacale D., Belkacemi y., Bey R., Galula G., Taright N., Tannier X., Rance B., Flicoteaux R., Hemery F., Audureau E., Chatellier G. and Tournigand C., 2022. Impact of two waves of Sars-Cov2 outbreak on the number, clinical presentation, care trajectories and survival of patients newly referred for a colorectal cancer: A French multicentric cohort study from a large group of University hospitals. {International Journal of Cancer}. 150, pp.1609-1618. 10.1002/ijc.33928