Skip to content

LLM Markup Extractor[source]

The eds.llm_markup_extractor component extracts entities using a Large Language Model (LLM) by prompting it to annotate the text with a markup format (XML or Markdown). The component can be configured with a set of labels to extract, and can be provided with few-shot examples to improve performance.

In practice, along with a system prompt that describes the allowed labels, annotation format and few-shot examples (if any), the component sends documents to an LLM like:

La patient a une néphropathie diabétique.

and expects in return the same text annotated with the entities, for instance:

La patient a une <diag>néphropathie diabétique</diag>.

which is then parsed to extract the entities. This approach is close to the one of Naguib et al., 2024 but supports various markup formats and multi label prompts. Lookup their paper for more details on the prompting strategies and performance.

Experimental

This component is experimental. The API and behavior may change in future versions. Make sure to pin your edsnlp version if you use it in a project.

Dependencies

This component requires several dependencies. Run the following command to install them:

pip install openai bm25s Stemmer
We recommend even to add them to your pyproject.toml or requirements.txt.

Examples

If your data is sensitive, we recommend you to use a self-hosted model with an OpenAI-compatible API, such as vLLM.

Start a server with the model of your choice:

python -m vllm.entrypoints.openai.api_server \
   --model mistral-small-24b-instruct-2501 \
   --port 8080 \
   --enable-prefix-caching

You can then use the llm_markup_extractor component as follows:

import edsnlp, edsnlp.pipes as eds

prompt = """
You are a XML-based extraction assistant.
For every piece of text the user provides, you will rewrite the full text
word for word, adding XML tags around the relevant pieces of information.

You must follow these rules strictly:
- You must only use the provided tags. Do not invent new tags.
- You must follow the original text exactly: do not alter it, only add tags.
- You must always close every tag you open.
- If a piece of text does not contain any of the information to extract, you must return the text unchanged, without any tags.
- Be consistent in your answers, similar queries must lead to similar answers, do not try to fix your prior answers.
- Do not add any comment or explanation, just write the text with tags.

Example with an <noun_group> tag:
User query: "This is a sample document."
Assistant answer: "This is <noun_group>a sample document</noun_group>."

The tags to use are the following:
- <diag>: A medical diagnosis
- <treat>: A medical treatment
""".strip()

# EDS-NLP util to create documents from Markdown or XML markup.
# This has nothing to do with the LLM component itself.
conv = edsnlp.data.converters.MarkupToDocConverter(preset="xml")
train_docs = [  # (1)!
    conv("Le patient a une <diag>pneumonie</diag>."),
    conv("On prescrit l'<treat>antibiothérapie</treat>."),
    # ... add more examples if you can
]

nlp = edsnlp.blank("eds")
nlp.add_pipe(eds.sentences())
nlp.add_pipe(
    eds.llm_markup_extractor(
        # OpenAI-compatible API like the local vLLM server above
        api_url="http://localhost:8080/v1",
        model="my-custom-model",
        examples=train_docs,
        # Apply the model to each sentence separately
        context_getter="sents",
        # String or function that returns a list of messages (see below)
        prompt=prompt,
        use_retriever=True,
        # For each request, show the model the closest example
        max_few_shot_examples=1,
        # Up to 5 requests in parallel
        max_concurrent_requests=5,
    )
)
doc = nlp("Le patient souffre de tuberculose. On débute une antibiothérapie.")
print([(ent.text, ent.label_) for ent in doc.ents])
# Out: [('tuberculose', 'diag'), ('antibiothérapie', 'treat')]
  1. You could also use EDS-NLP's data API
    import edsnlp
    
    train_docs = edsnlp.data.from_iterable(
        [
            "Le patient a une <diag>pneumonie</diag>.",
            "On prescrit l'<treat>antibiothérapie</treat>.",
        ],
        converter="markup",
        preset="xml",
    )
    

You can also control the prompt more finely by providing a callable instead of a string. For instance, let's put all few-shot examples in the system message, and the actual user query in a single user message:

def prompt(doc_text, examples):
    system_content = (
        "You are a XML-based extraction assistant.\n"
        "Here are some examples of what's expected:\n"
    )
    for ex_text, ex_markup in examples:
        system_content += f"- User: {ex_text}\n"
        system_content += f"  Bot answer: {ex_markup}\n"
    return [
        {"role": "system", "content": system_content},
        {"role": "user", "content": doc_text},
    ]

Parameters

PARAMETER DESCRIPTION
nlp

Pipeline object.

TYPE: PipelineProtocol DEFAULT: None

name

Component name.

TYPE: str DEFAULT: None

api_url

The base URL of the OpenAI-compatible API. You must explicitly provide this to avoid leaking requests to the public OpenAI API. Should you work with sensitive data, consider using a self-hosted model.

TYPE: str

model

The model name to use. Must be available on the API server.

TYPE: str

markup_mode

The markup format to use when formatting the few-shot examples and parsing the model's output. Either "xml" (default) or "md" (Markdown). Make sure the prompt template matches the chosen format.

TYPE: Literal['xml', 'md'] DEFAULT: xml

alignment_threshold

The threshold used to align the model's output with the original text.

TYPE: float DEFAULT: 0.0

prompt

The prompt is the main way to control the model's behavior. It can be either:

  • A string, which will be used as a system prompt. Few-shot examples (if any) will be provided as user/assistant messages before the actual user query.
  • A callable that takes two arguments:

    • doc_text: the text of the document to process and returns a list of messages in the format expected by the OpenAI chat completions API.
    • examples: a list of few-shot examples, each being a tuple of (text, markup annotated text)

TYPE: Union[str, Callable[[str, List[Tuple[str, str]]], List[Dict[str, str]]]]

examples

Few-shot examples to provide to the model. The more the better, but the total number of tokens in the prompt must be less than the model's context size. If use_retriever is set to True, the most relevant examples will be selected automatically.

TYPE: Optional[Iterable[Doc]] DEFAULT: ()

max_few_shot_examples

The maximum number of few-shot examples to provide to the model. Default to -1 (all examples).

TYPE: int DEFAULT: -1

use_retriever

Whether to use a retriever to select the most relevant few-shot examples. If None (default), it will be set to True if max_few_shot_examples is greater than 0 and the number of examples is greater than max_few_shot_examples. If set to False, the first max_few_shot_examples will be used.

TYPE: Optional[bool] DEFAULT: None

context_getter

This parameter controls the contexts given to the model for each request. It can be used to split the document into smaller chunks, for instance sentences by setting context_getter="sents", or process just a part of the document, for instance with context_getter={"sections": "conclusion"}. If None (default), the whole document is processed in a single request.

TYPE: Optional[SpanGetterArg] DEFAULT: None

span_setter

On which span group (doc.spans[...] or doc.ents) to set the extracted entities.

TYPE: SpanSetterArg DEFAULT: {'ents': True}

span_getter

From which span group (doc.spans[...] or doc.ents) to get the spans to annotate from few-shot examples. Default to the same as span_setter.

TYPE: Optional[SpanGetterArg] DEFAULT: None

seed

Optional seed forwarded to the API.

TYPE: Optional[int] DEFAULT: None

max_concurrent_requests

Maximum number of concurrent span requests per document.

TYPE: int DEFAULT: 1

api_kwargs

Extra keyword arguments forwarded to chat.completions.create.

TYPE: Dict[str, Any] DEFAULT: None

on_error

Error handling strategy. If "raise", exceptions are raised. If "warn", exceptions are logged as warnings and processing continues.

TYPE: Literal['raise', 'warn'] DEFAULT: raise

Authors and citation

The eds.llm_markup_extractor component was developed by AP-HP's Data Science team.


  1. Naguib M., Tannier X. and Névéol A., 2024. Few-shot clinical entity recognition in English, French and Spanish: masked language models outperform generative model prompting. 10.18653/v1/2024.findings-emnlp.400