GPT-2 for Question Answering

One of the questions that I have been particularly interested in since the early days of the OpenAI Scholars Program has been how reasoning and inference can be improved in Natural Language Understanding (NLU). Existing methods attain reasoning by using various forms of neural network models or ensemble learning, mainly on the task of Question Answering (QA) to evaluate how well a model can reason about and predict the correct answer to a given question. This is because the QA task provides a quantifiable way to test a system’s reasoning ability. During the last month, I have further explored how we could enhance reasoning by taking advantage of unprecedentedly successful language models, such as GPT-2, and fine-tuning it for QA as a humble step towards a better understanding of reasoning and identifying the areas that, in its zero-shot setting, has only rudimentary results. I have experimented with fine-tuning the first public release version of GPT-2 (117M) and trained my model on the Stanford Question Answering Dataset (SQuAD) 2.0. Taking a research focused approach, the idea for this project has been inspired by questions, such as how we can maximize the use of unsupervised language models’ potential and whether or not we can generalize GPT-2’s performance in language modeling on downstream tasks, such as QA, with minimal or no adaptations.

Language Models and Semi-Supervised Learning

With the recent success of pre-trained language models, NLP has gained the ability to take advantage of transfer learning, a machine learning technique that has been instrumental in the advancement of computer vision research for years. Language models, such as GPT-2, have outperformed previously applied methods using earlier NLP techniques, such as word embeddings. Such models allow us to make use of unsupervised learning through their generative pre-training that uses large corpuses of text which would otherwise be impractical, if not impossible, to use in a supervised manner. A semi-supervised learning approach, consisting of unsupervised pre-training paired with supervised fine-tuning, has demonstrated considerable success in many NLP tasks. Consequently, it has become a common practice in NLP research to fine-tune a pre-trained language model for a specific task in search of an optimal learning accuracy.

The main focus of such models are such that produced successful results by utilizing semi-supervised learning. The desired outcome is to create sophisticated neural network architectures, either consisting of a single model, or ensembles of a few, and use them on top of the pre-trained model to train on labeled data. This approach has allowed researchers to utilize the power of language models while also allowing them to contemplate about the possibility of attaining more general systems in a zero-shot setting wherein models do not need labeled data to train. Until we are able to generalize high performance of language models, the intersection between unsupervised and supervised learning methods presents an opportunity for useful experiments that not only help advance the field of NLP research but also let us understand language models better.

GPT-2 and SQuAD

GPT-2, a Transformer-based language model and a successor to GPT, has shown unprecedented performance in language modeling, primarily due to its over an order of magnitude more parameters. While GPT-2’s performance on QA with no task-specific training is embryonic, it indicates that an unsupervised language model could contribute to their performance through fine-tuning.

Figure 1 GPT-2 F1 Score for Reading Comprehension in Zero-shot setting
Image source: “Better Language Models and Their Implications

The pre-training task for GPT-2 is language modeling, and unlike GPT, it does not have any task-specific fine-tuning. The downstream tasks are implemented as conditional probabilities. In order for the model to perform a QA task, for example, it is provided with pairs of questions and answers from the context.

SQuAD 2.0, a reading comprehension dataset, consists of questions on Wikipedia articles, where the answer is a span of text extracted from the passage answering the question in a logical and cogent manner. Unlike version 1.0, SQuAD 2.0 includes 50,000 unanswerable questions written adversarially to look similar to answerable ones. Thus, a system needs to determine when no answer is supported by the paragraph and abstain from answering. Since SQuAD is a closed dataset that always includes the answer of a given question in its context, a system applied to this dataset needs to learn to answer factoid style questions. Many models that have performed well on the dataset so far have used various forms of attention flow mechanisms to match the questions to the strings in the text. Features that cannot be extracted easily from a given context, such as common sense and reasoning, are woven into the questions and answers that were annotated by humans.


I have implemented a fine-tuned model on the first public release of GPT-2 (117M) by adding a linear classifier layer that uses the output of the pre-trained model. I worked in PyTorch and used Huggingface’s Pytorch implementation of GPT-2 and based my experiment on their BERT for question answering model with modifications to run it on GPT-2.

After dissecting the inner mechanisms of GPT-2 so as to design a fine-tuning system accordingly, I have experimented with a few different variations for the fine-tuning layer, including a BiLSTM — as a naive effort to circumvent the unidirectionality of GPT-2’s architecture so as to smooth out the process of decoding the outputs of GPT-2. However, the results of most of these attempts have resulted in a trivial increase in Exact Match (EM) evaluation metric score and a slight decrease in F1 score compared to the linear model. The initial fine-tuned model did not perform well during my runs for the QA task on SQuAD. After three epochs, the initial model was able to have a score of 51 both in EM and F1 metrics. Upon trying variations of tokenization, hyperparameter fine-tuning, as well as adding more epochs, I was able to improve the results only slightly.

Although these results are unremarkable compared to that of more advanced models listed on the SQuAD leaderboard, with a more refined fine-tuning, compute and training, as well as taking the very recently released GPT-2 345M that I have not yet had the chance to experiment on, GPT-2’s potential to perform well on QA can be represented better. The current results that I have attained are indicative of the weaknesses of both the model I implemented and potentially the size of the language model I used. As an ongoing experiment, I am excited to explore the possibilities of improving performance on the QA task and reasoning by using more complex datasets, such as HotpotQA, in the coming weeks.

In my experiments, I focused more on identifying the reasons that lead to a relatively mediocre performance on a factoid-style question answering task than on the scores themselves. Based on my observations in each iteration of changes that I used in my model, I noticed that such reasons could likely stem from the architectural mechanisms of GPT-2 and the nature of the QA task, as well as the structure of SQuAD. The implications of the low results from a fine-tuned model, as opposed to the remarkable results of the language model, are informative as to what future improvements to Transformer-based language modeling will allow, thus improving learning accuracy on downstream tasks with minimal or even no task-specific adaptation. A more sophisticated fine-tuning experiment with a better extraction system and more compute would highly likely generate better results.

Future Research

In my future research, I am interested in experimenting with ways of improving reasoning in Natural Language Understanding (NLU) through pretrained language models in an attempt towards the goal of approaching to human-level reasoning. Using the recent release of GPT-2 345M, I plan to work on systems and test them on datasets that require multi-step and implicit reasoning rather than factoid-style question answering datasets, such as SQuAD.

With datasets that reflect human nature and intelligence to a greater degree, models powered by supervised and unsupervised learning might allow us to improve reasoning in detecting answers, sentiments, biases, and adversariality among other tasks and paradigms while the search for more generalized architectures continues. I plan to continue working on NLU research in my future endeavors.


I am grateful for this wonderful experience the OpenAI Scholars Program provided me. I would like to extend my sincere thanks to my mentor Jonathan Raiman, who has been very supportive and whose pertinent guidance has been invaluable for my learning throughout the program. I also want to thank to my cohort, whose talent, work and ideas have been truly inspiring and who have sincerely showed their support and friendship, to the OpenAI team members that I had the chance to meet and whose work I learned from as they generously shared their knowledge, and to Maddie Hall, who has been truly amazing in organizing the program and in her effective communication with the scholars.

3 thoughts on “GPT-2 for Question Answering

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s