LLM Markup Extractor[source]
The eds.llm_markup_extractor component extracts entities using a Large Language Model (LLM) by prompting it to annotate the text with a markup format (XML or Markdown). The component can be configured with a set of labels to extract, and can be provided with few-shot examples to improve performance.
In practice, along with a system prompt that describes the allowed labels, annotation format and few-shot examples (if any), the component sends documents to an LLM like:
La patient a une néphropathie diabétique.
and expects in return the same text annotated with the entities, for instance:
La patient a une <diag>néphropathie diabétique</diag>.
which is then parsed to extract the entities. This approach is close to the one of Naguib et al., 2024 but supports various markup formats and multi label prompts. Lookup their paper for more details on the prompting strategies and performance.
Experimental
This component is experimental. The API and behavior may change in future versions. Make sure to pin your edsnlp version if you use it in a project.
Dependencies
This component requires several dependencies. Run the following command to install them:
pip install openai bm25s Stemmer
pyproject.toml or requirements.txt. Examples
If your data is sensitive, we recommend you to use a self-hosted model with an OpenAI-compatible API, such as vLLM.
Start a server with the model of your choice:
python -m vllm.entrypoints.openai.api_server \
--model mistral-small-24b-instruct-2501 \
--port 8080 \
--enable-prefix-caching
You can then use the llm_markup_extractor component as follows:
import edsnlp, edsnlp.pipes as eds
prompt = """
You are a XML-based extraction assistant.
For every piece of text the user provides, you will rewrite the full text
word for word, adding XML tags around the relevant pieces of information.
You must follow these rules strictly:
- You must only use the provided tags. Do not invent new tags.
- You must follow the original text exactly: do not alter it, only add tags.
- You must always close every tag you open.
- If a piece of text does not contain any of the information to extract, you must return the text unchanged, without any tags.
- Be consistent in your answers, similar queries must lead to similar answers, do not try to fix your prior answers.
- Do not add any comment or explanation, just write the text with tags.
Example with an <noun_group> tag:
User query: "This is a sample document."
Assistant answer: "This is <noun_group>a sample document</noun_group>."
The tags to use are the following:
- <diag>: A medical diagnosis
- <treat>: A medical treatment
""".strip()
# EDS-NLP util to create documents from Markdown or XML markup.
# This has nothing to do with the LLM component itself.
conv = edsnlp.data.converters.MarkupToDocConverter(preset="xml")
train_docs = [ # (1)!
conv("Le patient a une <diag>pneumonie</diag>."),
conv("On prescrit l'<treat>antibiothérapie</treat>."),
# ... add more examples if you can
]
nlp = edsnlp.blank("eds")
nlp.add_pipe(eds.sentences())
nlp.add_pipe(
eds.llm_markup_extractor(
# OpenAI-compatible API like the local vLLM server above
api_url="http://localhost:8080/v1",
model="my-custom-model",
examples=train_docs,
# Apply the model to each sentence separately
context_getter="sents",
# String or function that returns a list of messages (see below)
prompt=prompt,
use_retriever=True,
# For each request, show the model the closest example
max_few_shot_examples=1,
# Up to 5 requests in parallel
max_concurrent_requests=5,
)
)
doc = nlp("Le patient souffre de tuberculose. On débute une antibiothérapie.")
print([(ent.text, ent.label_) for ent in doc.ents])
# Out: [('tuberculose', 'diag'), ('antibiothérapie', 'treat')]
- You could also use EDS-NLP's data API
import edsnlp train_docs = edsnlp.data.from_iterable( [ "Le patient a une <diag>pneumonie</diag>.", "On prescrit l'<treat>antibiothérapie</treat>.", ], converter="markup", preset="xml", )
You can also control the prompt more finely by providing a callable instead of a string. For instance, let's put all few-shot examples in the system message, and the actual user query in a single user message:
def prompt(doc_text, examples):
system_content = (
"You are a XML-based extraction assistant.\n"
"Here are some examples of what's expected:\n"
)
for ex_text, ex_markup in examples:
system_content += f"- User: {ex_text}\n"
system_content += f" Bot answer: {ex_markup}\n"
return [
{"role": "system", "content": system_content},
{"role": "user", "content": doc_text},
]
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
nlp | Pipeline object. TYPE: |
name | Component name. TYPE: |
api_url | The base URL of the OpenAI-compatible API. You must explicitly provide this to avoid leaking requests to the public OpenAI API. Should you work with sensitive data, consider using a self-hosted model. TYPE: |
model | The model name to use. Must be available on the API server. TYPE: |
markup_mode | The markup format to use when formatting the few-shot examples and parsing the model's output. Either "xml" (default) or "md" (Markdown). Make sure the prompt template matches the chosen format. TYPE: |
alignment_threshold | The threshold used to align the model's output with the original text. TYPE: |
prompt | The prompt is the main way to control the model's behavior. It can be either:
TYPE: |
examples | Few-shot examples to provide to the model. The more the better, but the total number of tokens in the prompt must be less than the model's context size. If TYPE: |
max_few_shot_examples | The maximum number of few-shot examples to provide to the model. Default to -1 (all examples). TYPE: |
use_retriever | Whether to use a retriever to select the most relevant few-shot examples. If TYPE: |
context_getter | This parameter controls the contexts given to the model for each request. It can be used to split the document into smaller chunks, for instance sentences by setting TYPE: |
span_setter | On which span group ( TYPE: |
span_getter | From which span group ( TYPE: |
seed | Optional seed forwarded to the API. TYPE: |
max_concurrent_requests | Maximum number of concurrent span requests per document. TYPE: |
api_kwargs | Extra keyword arguments forwarded to TYPE: |
on_error | Error handling strategy. If TYPE: |
Authors and citation
The eds.llm_markup_extractor component was developed by AP-HP's Data Science team.
Naguib M., Tannier X. and Névéol A., 2024. Few-shot clinical entity recognition in English, French and Spanish: masked language models outperform generative model prompting. 10.18653/v1/2024.findings-emnlp.400