Extractive Question Answering[source]

The eds.extractive_qa component is a trainable extractive question answering component. This can be seen as a Named Entity Recognition (NER) component where the types of entities predicted by the model are not pre-defined during the training but are provided as prompts (i.e., questions) at inference time.

The eds.extractive_qa shares a lot of similarities with the eds.ner_crf component, and therefore most of the arguments are the same.

Extractive vs Abstractive Question Answering

Extractive Question Answering differs from Abstractive Question Answering in that the answer is extracted from the text, rather than generated (à la ChatGPT) from scratch. To normalize the answers, you can use the eds.span_linker component in synonym mode and search for the closest synonym in a predefined list.

Examples

import edsnlp, edsnlp.pipes as eds

nlp = edsnlp.blank("eds")
nlp.add_pipe(
    eds.extractive_qa(
        embedding=eds.transformer(
            model="prajjwal1/bert-tiny",
            window=128,
            stride=96,
        ),
        mode="joint",
        target_span_getter="ner-gold",
        span_setter="ents",
        questions={
            "disease": "What disease does the patient have?",
            "drug": "What drug is the patient taking?",
        },  # (1)!
    ),
    name="qa",
)

To train the model, refer to the Training tutorial.

Once the model is trained, you can use the questions attribute (next section) on the document you run the model on, or you can change the global questions attribute:

nlp.pipes.qa.questions = {
    "disease": "When did the patient get sick?",
}

Dynamic Questions

You can also provide

eds.extractive_qa(..., questions_attribute="questions")

to get the questions dynamically from an attribute on the Doc or Span objects (e.g., doc._.questions). This is useful when you want to have different questions depending on the document.

To provide questions from a dataframe, you can use the following code:

dataframe = pd.DataFrame({"questions": ..., "note_text": ..., "note_id": ...})
stream = edsnlp.data.from_pandas(
    dataframe,
    converter="omop",
    doc_attributes={"questions": "questions"},
)
stream.map_pipeline(nlp)
stream.set_processing(backend="multiprocessing")
out = stream.to_pandas(converters="ents")

Parameters

PARAMETER	DESCRIPTION
`name`	Name of the component TYPE: `str`
`embedding`	The word embedding component TYPE: `WordEmbeddingComponent`
`questions`	The questions to ask, as a mapping between the entity type and the list of questions to ask for this entity type (or single string if only one question). TYPE: `Dict[str, AsList[str]]` DEFAULT: `{}`
`questions_attribute`	The attribute to use to get the questions dynamically from the Doc or Span objects (as returned by the `context_getter` argument). If None, the questions will be fixed and only taken from the `questions` argument. TYPE: `Optional[str]` DEFAULT: `questions`
`context_getter`	What context to use when computing the span embeddings (defaults to the whole document). For example `{"section": "conclusion"}` to only extract the entities from the conclusion. TYPE: `Optional[SpanGetterArg]` DEFAULT: `None`
`target_span_getter`	Method to call to get the gold spans from a document, for scoring or training. By default, takes all entities in `doc.ents`, but we recommend you specify a given span group name instead. TYPE: `SpanGetterArg` DEFAULT: `None`
`span_setter`	The span setter to use to set the predicted spans on the Doc object. If None, the component will infer the span setter from the target_span_getter config. TYPE: `Optional[SpanSetterArg]` DEFAULT: `None`
`infer_span_setter`	Whether to complete the span setter from the target_span_getter config. False by default, unless the span_setter is None. TYPE: `Optional[bool]` DEFAULT: `None`
`mode`	The CRF mode to use : independent, joint or marginal TYPE: `Literal['independent', 'joint', 'marginal']` DEFAULT: `joint`
`window`	The window size to use for the CRF. If 0, will use the whole document, at the cost of a longer computation time. If 1, this is equivalent to assuming that the tags are independent and will the component be faster, but with degraded performance. Empirically, we found that a window size of 10 or 20 works well. TYPE: `int` DEFAULT: `40`
`stride`	The stride to use for the CRF windows. Defaults to `window // 2`. TYPE: `Optional[int]` DEFAULT: `None`