Generative Document Question Answering with HuggingFace

Generating answers about a given document/article using pre-trained models on HuggingFace.
transformer
llm
machine learning
HuggingFace
Author

Stefan Schneider

Published

January 11, 2025

Modified

January 11, 2025

Extractive Question Answering

In a previous blog post, I showed how answer document-related questions with HuggingFace LLMs in just a few lines of Python code and visualize them as simple Gradio App.

In that blog post, I used the standard question-answering pipeline from HuggingFace. This pipeline defaults to a DistilBERT model (a smaller BERT model) fine-tuned on the Stanford Question Answering Dataset (SQuAD). This model and dataset are meant for extractive question answering as illustrated in the following example:

%%capture --no-display
pip install -U pypdf torch transformers
from transformers import pipeline

extractive_qa = pipeline(task="question-answering")

# Abstract from "Attention is all you need" by Vaswani et al.: https://arxiv.org/abs/1706.03762
abstract = """The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transformer,
based solely on attention mechanisms, dispensing with recurrence and convolutions
entirely. Experiments on two machine translation tasks show these models to
be superior in quality while being more parallelizable and requiring significantly
less time to train. Our model achieves 28.4 BLEU on the WMT 2014 Englishto-German translation task...
"""
question = "What's a transformer'?"

extractive_qa(question=question, context=abstract)
No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0
{'score': 0.4559027850627899,
 'start': 287,
 'end': 302,
 'answer': 'the Transformer'}

The pipeline is given a text as input, here parts of the “Attention is all you need” abstract (see arxiv), and a question that should be answered based on the given text/context.

Rather than an answer in natural language, the model outputs an excerpt that is extraced from the original context, given by a start- and end-index within. While this allows concise answers with clear reference to the original source, the answers are not very natural or accurate. The model has no way of combining and merging information from different places of the original text since it can only return a single contiguous excerpt.

In the example above, I asked what a transformer is and the model simply answered “the Transformer”. Not very helpful! (Note that the answer may be slightly different in the future, since I did not pin a model and model version in the pipeline.)

Even passing the entire article into the model as context, does not improve the answer - it still only outputs “Transformer” as answer.

# Read PDF
from pathlib import Path
from typing import Union
from pypdf import PdfReader


def get_text_from_pdf(pdf_file: Union[str, Path]) -> str:
    """Read the PDF from the given path and return a string with its entire content."""
    reader = PdfReader(pdf_file)
    # Extract text from all pages
    full_text = ""
    for page in reader.pages:
        full_text += page.extract_text()
    return full_text


# Read in the full article downloaded from https://arxiv.org/abs/1706.03762
full_article = get_text_from_pdf("transformer-paper.pdf")
# Print first few characters of the paper
print(full_article[:300])
Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need
Ashish Vaswani∗
Google Brain
avaswani@google.com
Noam Shazeer∗
Google Brain
noam@google.com
Niki Par
# Try to answer the same question as before with the full article as context
extractive_qa(question=question, context=full_article)
{'score': 0.20687614381313324,
 'start': 22735,
 'end': 22746,
 'answer': 'Transformer'}

Generative Question Answering

As shown above, extractive question answering is about answering a question by providing an excerpt from the given context. In contrast, generative or abstractive question answering (Q&A) provides generated answers that do not directly reference any parts of the original context.

Such generated answers often sound more natural and can be more useful. On the other hand, there is no clear link to the original source and the answer may be just a hallucination of the model.

In the following, I try to build a generative Q&A pipeline. While encoder-only models like BERT are best for extractive Q&A, encoder-decoder or decoder-only models are better suited to generate natural answers for generative Q&A.

Existing Models for Generative Q&A

Let’s use an existing encoder-decoder model from HuggingFace to try generative Q&A, e.g., the FLAN-T5. In comparison to the normal T5 model, the FLAN-T5 was fine-tuned on more downstream tasks:

If you already know T5, FLAN-T5 is just better at everything. For the same number of parameters, these models have been fine-tuned on more than 1000 additional tasks covering also more languages

HuggingFace does not have a pre-defined “generative Q&A” pipeline task, instead this belongs to “Text2Text Generation” as the input consists of the context and questions and the output is the generated answer.

The following code uses the FLAN-T5 model to generate an answer based on the full “Attention is all you need” article for the same question as above: What’s a transformer?

generative_qa_t5 = pipeline(task="text2text-generation", model="google/flan-t5-base")
input_text = f"{full_article} Given this context, please answer the following question. {question}"
generative_qa_t5(input_text)
Device set to use mps:0
Token indices sequence length is longer than the specified maximum sequence length for this model (10385 > 512). Running this sequence through the model will result in indexing errors
[{'generated_text': 'a model architecture relying entirely on self-attention to compute representations of its input and'}]

“a model architecture relying entirely on self-attention to compute representations of its input and”

Not bad! The sentence ends out of nowhere, but this generated answer still makes sense. Much more so than the extracted answer above.

Dealing with Limited Sequence Length

While the answer was good, there was a warning in the output of the pipeline above:

Token indices sequence length is longer than the specified maximum sequence length for this model (10385 > 512).

The configured FLAN-T5 model can only handle input sequences of maximum 512 tokens. The full research article is much longer (a bit more than 10k tokens).

Apparently, the HuggingFace pipeline already has some built-in mechanism to handle these overly long sequences, such that the model still output a sensible answer and did not crash despite the sequence being too long.

Splitting the Sequence into Shorter Parts

A simple approach to handle such overly long sequences is to split them into smaller parts that fit into the model’s maximum sequence length. Let’s split the full text into 20 parts, such that each part has at most 512 tokens.

# Split the full text into parts and use them separately for answering the question.
def split_text_into_parts(full_text: str, num_parts: int) -> list[str]:
    """Split the given full text into a list of equally sized parts."""
    len_per_part: int = int(len(full_text) / num_parts)
    return [full_text[i * len_per_part : (i+1) * len_per_part] for i in range(num_parts)]

text_parts = split_text_into_parts(full_article, num_parts=20)
for text_part in text_parts:
    input_text = f"{text_part} Given this context, please answer the following question. {question}"
    print(generative_qa_t5(input_text))
[{'generated_text': 'based solely on attention mechanisms'}]
[{'generated_text': 'tensor2tensor'}]
[{'generated_text': 'a model architecture eschewing recurrence and instead relying entirely on'}]
[{'generated_text': 'first transduction model relying entirely on self-attention to compute representations of its input and'}]
[{'generated_text': 'a decoder'}]
[{'generated_text': 'a single attention head'}]
[{'generated_text': 'encoder-decoder attention mechanisms'}]
[{'generated_text': 'encoder and decoder stacks'}]
[{'generated_text': 'encoder or decoder'}]
[{'generated_text': 'self-attention layer'}]
[{'generated_text': 'regularization'}]
[{'generated_text': 'transformer'}]
[{'generated_text': 'translation'}]
[{'generated_text': 'transformer'}]
[{'generated_text': 'attention-based model'}]
[{'generated_text': 'tensorflow'}]
[{'generated_text': 'LSTM networks'}]
[{'generated_text': 'neural machine translation'}]
[{'generated_text': '[34]'}]
[{'generated_text': 'a syst'}]

Having split the text into 20 parts, we now get 20 answers. Some of them are more useful than others since these parts of the text apparently contain more useful information. Answer 4 sounds very similar to the one provided by the pipeline when passing in the whole article: “first transduction model relying entirely on self-attention to compute representations of its input and”

It seems like, under the hood, the HuggingFace pipeline also splits the full text into multiple parts, applying the model to each one. Likely, they use a more sophisticated way of splitting the parts with overlaps such that no information is lost at the boundaries between two parts.

To select the best out of all the provided answers, one could compute a score for each answer based on the average per-token score in the generated answer.

Using A Model with Long Sequence Length

An alternative to splitting a long sequence into smaller parts is to simply use another model with a longer supported sequence length, for example the Long-T5 model.

generative_qa_long_t5 = pipeline(task="text2text-generation", model="google/long-t5-local-base")
input_text = f"{full_article} Given this context, please answer the following question. {question}"
generative_qa_long_t5(input_text)
Some weights of LongT5ForConditionalGeneration were not initialized from the model checkpoint at google/long-t5-local-base and are newly initialized: ['lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use mps:0
/opt/homebrew/Caskroom/miniforge/base/envs/llm/lib/python3.12/site-packages/torch/nn/functional.py:5096: UserWarning: MPS: The constant padding of more than 3 dimensions is not currently supported natively. It uses View Ops default implementation to run. This may have performance implications. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/mps/operations/Pad.mm:465.)
  return torch._C._nn.pad(input, pad, mode, value)
[{'generated_text': 'formation formation trains trains trains trains rebuild Destin formationpartnered 1941 1941 nouveaux formation formation formation formationassemblée Lin Lin'}]

As you can see, the new model does not complain about the sequence being too long. Instead, it outputs a warning because the model is not fine-tuned for any downstream tasks such as Q&A. As a result, the generated answer is rubbish.

For better results, we should fine-tune the model on a Q&A dataset (such as DuoRC). In addition to the Long-T5, there are other models that focus explicitly on long sequence lengths, e.g., the Longformer and it’s encoder-decoder variant LED (Longformer Encoder-Decoder), which is more useful for generative Q&A.

I plan to dive deeper into long sequence lengths in a future blog post.

What’s Next?