Neural Networks Intuitions: 18. Generative Pretrained Transformer(GPT) Series

20 min readJun 2, 2024

Hello All!

It is great to be back with another article in my series “Neural Networks Intuitions” and for the first time, I am deviating from Computer Vision and diving into Natural Language Processing — you all know why :D

The past 1–1.5 years have been crazy with the release of GPT-3.5 and GPT-4, their fantastic ability to solve wide range of natural language tasks — right from question answering, summarization, visual QA and even code generation.

And anyone who has followed this recent trend would know about the talks regarding the emergence of AGI which led to two school of thoughts — one set of people who truly believe that Language Models can lead us to AGI, while the other set sharply denies this premise by arguing that although there are a few signs of reasoning, there is no true extrapolation as such and all of it comes down to the scale of training data. The second group’s criticism is that unless there is visibility into the training data distribution and having a test set(without contaminating the test data) along with a proper evaluation methodology, it is unfair to say that these LLMs truly lead to AGI.

Now in order to truly validate some of these closed source GPT models(such as GPT-4, Gemini, Claude etc), we need to be aware of the training data distribution — which is highly unlikely. But another way to understand if LLMs can actually generalize to novel distribution or not is by actually digging deep into the evolution of these architectures/algorithms — more specifically the evolution of GPT series.

In this article, I will be going over a series of GPT papers — GPT-1, GPT-2, GPT-3 and InstructGPT, and try to understand what makes these models truly generalizable or universal unlike the other architectures.

Note: The article is quite detailed and if you would want a quick overview, you can skip to Summary — Evolution of GPTs and My Take on LLMs.

1. Improving Language Understanding by Generative Pre-Training — GPT-1

Paper: GPT — 1

To start with, let’s go back to 2017–18 when NLP tasks where being solved much more effectively after the groundbreaking paper in 2017 — “Attention is All You Need” — Transformers.

It was the time when everyone started adopting Transformers for pretty much every NLP task — right from document generation, question answering, language translation and many more. Although there was an effective neural network to solve the downstream tasks, the dependence on data was still a long-standing problem — i.e. every task required a good amount of labelled data in order for the neural networks to perform accurately.

Problem:

Given large amounts of unlabeled text data but scarcely labeled text data, how to effectively learn representations from the raw text that can transfer well to downstream finetuning task and also avoid large dependency on labeled data.

Solution:

The solution that the authors propose in the paper is as follows:

unsupervised pretraining using language modeling objective
followed by supervised finetuning on the downstream task.

1. Unsupervised Pretraining using Language Models:

In this step, the authors propose to train a language model using Transformers on a large raw, unlabeled text corpus, which can then be transferred for downstream finetuning.

But before go deep into the pretraining part, let’s understand what a language model is and why is it necessary for learning effective tranferable representations.

Language Model(LM):

In plain terms, a language model learns to predict the next token in a sentence given all the tokens preceding it.
It tries to learn a representation for every token(or a word) that is a function of the entire input sentence or the tokens preceding it(depending on uni or bidirectional).

Mathematically, it is written as,

Where does this language modeling objective help?

LMs can be used as a auto-complete, in text completion tasks as well as in text generation tasks.

In general, language models(LM) learn a probability distribution over the set of tokens that exist in the output vocabulary and with the help of Transformers, they are able to model long range dependencies much better(leading to high sized context windows).

What is a context window?

The context window can be viewed as the receptive field covering the set of preceding tokens on which the current token is conditioned on. The larger the context window, the better context for the LM in generating the next token much more accurately.

One simple forward pass of a language model can be seen as below,

Provide a input word, and the LM generates the next word -> feed both the first word and generated word together as input again and the LM generates the next word conditioned on the two input words -> and this goes on till the stop token.

As you can see, a language model is more of a generative model that conditions on the preceding tokens or words to produce a next token/word. Hence often termed as a Generative Model.

Coming back to the unsupervised pretraining, the authors train a language model on BooksCorpus dataset which contains over 7,000 unique unpublished books from a variety of genres including Adventure, Fantasy, and Romance. This dataset also contains long stretches of contiguous text, which allows the generative model to learn to condition on long-range information.

2. Supervised finetuning with a auxiliary LM objective:

In order to effectively make use of the pretrained generative model, the authors reformulate the input of the downstream tasks into a sequence of tokens which can be processed by the pretrained model followed by a linear layer to be finetuned.

GPT-1 — Supervised Finetuning for various tasks

Above shows the task specific input transformations performed for various downstream tasks such as text classification, entailment, similarity and question answering. For all of these tasks, the inputs are structured in the form of a sequence with necessary delimiters, which is then fed to the LM and then a task specific finetuning on top to produce the output.

Results:

The paper shows significant performance improvements over the previous task specific architectures(with ensembles) on a variety tasks such as QA, common sense reasoning, text classification, similarity, natural language inference etc.

As stated in the paper,

Overall, our approach achieves new state-of-the-art results in 9 out of the 12 datasets we evaluate on, outperforming ensembles in many cases. Our results also indicate that our approach works well across datasets of different sizes, from smaller datasets such as STS-B (≈5.7k training examples) — to the largest one — SNLI (≈550k training examples).

However I would like to point out one key aspect here,

If you closely look at Table (2) results for SNLI dataset which contains ~550K training samples, the improvement gain observed is quite minimal — 0.6 unlike the large gains observed for downstream tasks with relatively lesser data.

This shows that the unsupervised pretraining helps in places where the labeled data is less compared to large scale labeled data — which sort of aligns with Self Supervised Learning(SSL) methods such as SimCLR/BYOL(who also report similar results for CV tasks).

How does generative pretraining help in improving accuracy for downstream finetuning?

Similar to SSL, a Generative Pretrained Transformer(GPT) results in a really good initialization point for the downstream model to learn and converge faster through effective transferable representations.

Zero-Shot Behaviours of GPT:

Why do language models learn effective representations?

A hypothesis is that the underlying generative model learns to perform many of the tasks we evaluate on in order to improve its language modeling capability and that the more structured attentional memory of the transformer assists in transfer compared to LSTMs.

The authors with the help of some heuristic solutions, use the underlying pretrained generative model and perform various NLP tasks — without supervised finetuning — in order to evaluate GPT’s zero shot ability.

They show that the performance of these heuristics is stable and steadily increases over training suggesting that generative pretraining supports the learning of a wide variety of task relevant functionality.

Key takeaways from GPT:

Language Modeling objective is key to learning transferable representations.
Transformer architecture(its inductive bias) is much more effective than LSTMs(or other recurrent networks) to assist this transfer.
GPT shows hints of zero-shot abilities(without any supervised finetuning).

2. Language Models are Unsupervised Multitask Learners — GPT-2

Paper: GPT-2

Now that Language Models(LMs) exhibit steady zero shot(although not by a a large margin, but promising enough), the researches from OpenAI continue to double-down on the trend by understanding the consequences of training a sufficiently larger Language model on a large scale web-data.

Problem:

How to enable Language Models learn wide range of NLP tasks without explicit supervision?

Approach:

Since the problem is to optimize for a language modeling objective but to use the trained LM to perform many other tasks, the key question now is how to enable the pretrained LM to perform these tasks?

In other words, how to make these language models understand the given task is summarization vs question answering vs language translation?. Task conditioning is often implemented at an architectural level or at an algorithmic level, but the authors of GPT-2 make use of the language in itself to perform various tasks.

Language provides a flexible way to specify tasks, inputs, and outputs all as a sequence of symbols.

For example, a translation training example can be written as the sequence (translate to french, english text, french text). Likewise, a reading comprehension training example can be written as (answer the question, document, question, answer).

The above is a key aspect that provides LMs the flexibility to perform different tasks as part of the input prompt itself.

Training dataset:

Motivation: To create a substantially large enough text dataset that can help the language model to learn a wide range of natural language tasks from the naturally occurring (text) demonstrations.

The authors talk about CommonCrawl dataset which is a source of unlimited and diverse text(usually taken from the Internet). And also state that even though these archives are many orders of magnitude larger than current language modeling datasets, they have significant data quality issues.

Note: The original GPT was trained on the BooksCorpus dataset.

Owing to CommonCrawl’s data quality issues, the authors only scrape web pages which have been curated/filtered by humans. Manually filtering a full web scrape would be exceptionally expensive so as a starting point, they have scraped all outbound links from Reddit, a social media platform, which received at least 3 karma. This can be thought of as a heuristic indicator for whether other users found the link interesting, educational, or just funny.

The resulting dataset, WebText, contains the text subset of these 45 million links and after deduplication and heuristic based cleaning resulted in a slightly over 8 million documents for a total of 40 GB of text.

As stated in the paper, the authors have also removed all Wikipedia documents from WebText since it is a common data source for other datasets and could complicate analysis due to overlapping training data with test evaluation tasks.

Model used is a Transformer and the Byte Pair Encoding(BPE) is used as input representation.

The authors have trained 4 different models — ranging from 115M, 345M, 762M to 1542M(1.5B) parameters.

Language Modeling on WebText and Zero-Shot Transfer on LM benchmarks:

The authors evaluate GPT-2 on a wide range of LM datasets and show that they transfer well on 7 out of the 8 datasets under a zero shot setting, even surpassing the SOTA.

GPT-2 Zero Shot Results on LM Benchmarks

A natural question that may arise is that GPT-2 is a large language model and that too trained on a huge web corpora, so what is really significant about its zero shot transfer to language modeling task?

This makes complete sense, but the next section shows some really interesting developments that indicates that these LMs can also generalize to tasks for which they are not trained for.

Zero shot Transfer on other NLP tasks:

The authors now evaluate GPT-2 on the following unseen tasks:

Children’s Book Test — on nouns and named entities.
LAMBADA dataset — for sentence completion which models long range dependencies.
Winograd Schema Challenge — for common sense reasoning
Reading Comprehension
Summarization
Language Translation
Question Answering

GPT-2 has surpassed the supervised SOTA baselines on (1), (3) and most importantly on Reading Comprehension datasets, while the results on the rest of the tasks are mixed.

GPT-2 Zero Shot Transfer to other NLP tasks

But how are these tasks performed with the help of a language model? How does the input-output formulation look like here?

Well, this is actually a valid question. And the answer/recipe is different for different tasks.

For text summarization,

To induce summarization behavior in GPT-2, they add the text TL;DR: after the article.

For language translation,

To help it infer that it is a translation task, they condition the language model on a context of example pairs of the format english sentence = french sentence and then after a final prompt of english sentence = they sample from the model with greedy decoding and use the first generated sentence as the translation.

For question answering,

Similar to translation, the context of the language model is seeded with example question answer pairs which helps the model infer the short answer style of the dataset.

Hints of Generalization:

In translation, they show that French to English translation performance is better than previous unsupervised baselines.
In question answering, GPT-2 has answered some of the questions which never appeared in WebText dataset.

Generalization vs Memorization:

This section addresses the critics who are on the side of training data contamination and say generalization is due to LMs being evaluated on the data on which they are trained on.

Since GPT-2 is trained on a web-scale dataset(WebText in this case), there are high chances of overlap across train and test dataset.

In order to analyze the percentage of overlap across WebText and evaluation datasets, the authors use Bloom filters that calculates percentage of 8-grams from a particular eval dataset that are also found in the WebText training set.

The below table shows the amount of overlap of WebText with evaluation datasets and evaluation datasets with their respective train sets themselves.

WebText Train overlap percentage with other evaluation datasets.

The authors re-evaluate the LMs on eval datasets where overlap data are excluded from the test set and observe a very minimal drop in accuracy/perplexity across the evaluations.

Overall, their analysis suggests that data overlap between WebText training data and specific evaluation datasets provides a small but consistent benefit to reported results. However, for most datasets we do not notice significantly larger overlaps than those already existing between standard training and test sets.

Another way to test Generalization vs Memorization hypothesis is to test LMs on their held out WebText test set. And the below graph clearly shows a train and test set perplexity are very similar and improve as the model scale increases.

GPT-2 performance on Webtext train and test set

The above graph clearly shows GPT-2 still underfits the WebText data.

Key takeaways from GPT-2:

If one remembers clearly, the major takeaway from GPT-1 was that LMs show signs of zero shot performance with effective transferable representations and continuing on the same trend, we could see the trend moving from a language model(LM) to a large language model(LLM),

When a large language model(LLM) is trained on a sufficiently large and diverse dataset, it is able to perform well across wide range of tasks under a zero shot setting which suggests that high-capacity models trained to maximize the likelihood of a sufficiently varied text corpus begin to learn how to perform a surprising amount of tasks without the need for explicit supervision.

3. Language Models are Few-Shot Learners — GPT-3

Paper: GPT-3

1. Motivation:

In continuation to GPT-2, there has been two promising directions in the path to build general purpose language systems.

Firstly, GPT-2 is able to perform a wide range of tasks with just natural language instructions. For eg: translate english to french, english text, french text. This is what the authors of GPT-3 call “in-context learning”, where a language model is enabled to perform various tasks with a directive and natural language demonstrations.

In-context learning that was first introduced in GPT-2

Secondly, there is a promising trend where scaling the model parameters and the dataset has brought improvements in various NLP tasks. More specifically, adoption of Transformers for all tasks(instead of task specific architectures) has driven a general solution.

One major aspect that the authors talk about is that humans do not require lots of examples to solve any particular task, but a brief instruction in natural language with a few examples would suffice.

Since in-context learning involves absorbing many skills and tasks within the parameters of the model, it is plausible that in-context learning abilities might show similarly strong gains with scale.

This serves as the motivation for GPT-3.

2. Approach

The authors train a large language model(LLM) — GPT-3 in a similar way as GPT-2 except that the scale of the model parameters in GPT-3 is 10-fold(175B parameters) and an internet scale data(CommonCrawl which is much larger in size than WebText).

The objective here is to use this pretrained LLM for downstream tasks without any finetuning as such.

And how exactly can the LLM be used for downstream tasks effectively with competitive or SOTA accuracy?

As mentioned in the previous section, GPT’s in-context learning in combination with a much larger scale can prove effective here, the answer to this will be found later :)

3. Training Dataset

In GPT-2, we saw that the usage of WebText data citing CommonCrawl dataset instead of being huge in size is prone to data quality issues. In GPT-3, the authors make use of a filtered version of CommonCrawl with some post cleaning, deduplication methods and has used several high quality curated text datasets.

One major concern however with using such a huge internet data is the data contamination where there can be a high overlap with test and train data, which contradicts the speculation of OOD generalization of LLMs.

And in order to have a fair evaluation of these models, the authors develop heuristics to find test-train overlap and remove them from the test set during evaluation.

GPT-3 Model Variants ranging from the smallest ~1B all the upto 175B which was the 10x more than any of the large language model at that time.

Few Shot Learning Abilities of GPT-3:

The hypothesis behind training GPT-3 was that the in-context learning of an LLM might improve drastically if the model and data is scaled(as per Neural Scaling Laws).

This proved to be right as the authors show some remarkable performances in few shot, single shot and sometimes zero shot settings(all without any gradient updates or finetuning).

What makes GPT-3 so good at few shot learning though?

As seen in GPT-2, we could see the model was able to understand the task specification with just natural language and doubling down on this aspect(coupled with model/data scale), GPT-3 is able to gain enough information from the context provided within the input prompt — task specification along with a few example demonstrations — which helps it answer questions correctly.

GPT-3 results on various tasks and datasets:

The authors overall report competing performances across a wide range of NLP tasks right from Translation, Question Answering, Common Sense Reasoning, Comprehension and many more.

Below is a good evidence suggesting why increasing model scale(and data) can result in remarkable accuracy sometimes beating finetuned SOTA.

Notably, there are obviously some datasets and tasks on which GPT-3 is still lagging behind by a lot, but with the help of in-context learning and few shot demonstrations, the model is able to perform decent — compared to its other unsupervised counterparts.

Test data contamination — is GPT-3 only memorizing?

We observed that in GPT-2 data contamination didn’t play a significant role in inflating the results held-out evaluation data.

However it need not be the same case with GPT-3 since the model is trained on a large scale internet data along with high quality curated data.

Now in order to ensure if the model is not trained on the evaluation data, the authors follow a similar process as in GPT-2 where they measure the amount of overlap between train-test data, remove the overlapping data and re-evaluate the models on the clean subset.

The below graph shows the change in accuracy for all eval datasets and shows a trend where change in performance for most of the datasets tend towards zero indicating that the test data contamination may not have a serious effect on the evaluation.

GPT-3 Performance Change when evaluated on a clean subset without train-test overlap

As stated in the paper, an important limitation of this contamination analysis is that it is unsure that the clean subset is drawn from the same distribution as the original dataset. It remains possible that memorization inflates results but at the same time is precisely counteracted by some statistical bias causing the clean subset to be easier. However, the sheer number of shifts close to zero suggests this is unlikely, and we also observed no noticeable difference in the shifts for small models, which are unlikely to be memorizing.

One more evidence the authors of GPT-3 provide to show that these LLMs wouldn’t actually be memorizing the training data is by comparing its performance on the validation set(val loss) and argue that the model is still underfitting(and may not be overfitting).

Key takeaways from GPT-3

Scaling language models proves to be more and more effective and as a result of in-context learning within these LLMs, a new path to few shot learning is unlocked — where natural language with a few example demonstrations helps LLMs solve novel tasks.

4. Training language models to follow instructions with human feedback — InstructGPT

Paper: InstructGPT/GPT3.5

Generative Pretrained Transformers have shown tremendous progress in text generation, be it helping in finetuning for specific tasks or few-shot learning by just providing natural instructions or “prompting”.

But there is one fundamental problem with LLMs:

LLMs are generally not good at following user instructions, generating biased or toxic texts, making up facts(or hallucinate) etc — this is because these models are predicting the next token on a webpage from the internet —which is different from the objective “follow the user’s instructions helpfully and safely”.

Hence, these language models are misaligned.

Problem:

How to align these large language models? Or in other words, make them follow user instructions and provide answers that are relevant to the user conversation.

Solution:

A straight forward way to make LLMs follow instructions and provide useful answers to questions is by finetuning them on a human labeled dataset with natural language instructions.

In simple terms,

A question or a prompt as the input to the LLM and the answer as the output that the LLM is supposed to generate.

The authors extend this idea and propose the following approach using reinforcement learning from human feedback:

construct a natural language instructions dataset with the help of expert humans and finetune GPT-3 on this dataset.
next, they collect the model’s(finetuned from (1)) responses for the OpenAI API prompts, use humans to compare outputs for the same prompt, rank these responses(to get a comparison dataset for various prompts) and train a reward model(RM) to predict human preferred output.
finally, they use this RM as a reward function and fine-tune the supervised learning baseline(from (1)) to maximize this reward using the PPO algorithm — Reinforcement Learning from Human Feedback(RLHF).

Note: I won’t be diving deep into the InstructGPT paper(especially RLHF) and the evaluation of these models — I will try to provide an overview and intuitions that can help in grasping the concept.

Training Dataset:

The prompt dataset consists primarily of text prompts submitted to the OpenAI API, specifically those using an earlier version of the InstructGPT models (trained via supervised learning on a subset of demonstration data) on the Playground interface.

The dataset consists of both prompts from the APIs, human labeled prompts and their respective responses.

InstructGPT vs GPT3.5:

InstructGPT when evaluated on a wide range of NLP tasks shows that it produces much better results in comparison to GPT-3 — in terms of human preferences, following instructions and hallucinates less — making them well aligned.

It seems an instruction finetuned LLM(i.e. InstructGPT) also show promising generalization to instructions outside of the RLHF finetuning distribution.

Key takeaway from InstructGPT:

Large Language Models when finetuned on human crafted natural language instructions with the help of RLHF to output preferred responses to questions, perform the task of following user instructions, produces less harmful content and is much more truthful.

Summary — Evolution of GPTs:

GPT-1: An unsupervised pretrained Language Model(LM) using Transformers serves as a better intialization for supervised finetuning of the downstream NLP tasks.
GPT-2: By scaling these language models and training them on a large scale raw text data, along with specially crafted instructions to specify various tasks, help them learn a wide range of NLP tasks — via in-context learning.
GPT-3: Increasing the parameters of these LMs even further shows that LLMs are capable of performing any given task with just a few examples shown — which involves no finetuning.
InstructGPT — In order to make the responses of GPTs more useful and truthful and to follow the user intent during conversations, finetuning these LLMs under a reinforcement learning regime(RLHF) makes them more aligned and outputs human preferred responses.

My take on LLMs:

The aspects that really stand out for me are:

A good initialization point from a large scale unsupervised or self-supervised pretraining is very important to learning generalizable representations — be it in NLP or Computer Vision.
Although LMs are trained to learn a probability distribution over the set of all tokens, they are still able to solve a variety of tasks other than language modeling such as question answering, comprehension , translation — despite not being explicitly trained for such tasks(with input-output pairs). And this clever usage of LLMs to solve different NLP tasks is a pretty cool thing, IMO!
In-context learning(ICL) in GPTs shows that there is more to these LLMs than that meets the eye — although there are a set of people who argue that LLMs are just doing approximate retrieval and that it requires the data to be part of their train set, some of these few shot or zero shot abilities are truly fascinating and makes us wonder are they really extrapolating or is it just interpolation in the end.

That aside, based on the evolution of LLMs, one thing is very clear,

that scaling is the way to go. Scaling both models and data seems to yield good returns but the question is how far can we go only by scaling?

Making use of these LLMs for realtime applications is also going to be challenging task.

Also, applying them for Comptuer Vision tasks seems to be an overkill as the task formulation has now been turned into a Question Answering one. But ICL could be a differentiator here.

Conclusion: There are definitely signs of OOD generalization in GPTs but one has to understand that this property do emerges out of learning patterns from the large scale data. Especially given there is not much visibility into how GPT-4 was trained on — be it algorithm or data, it can be safe to say that both the scale of data and the supervised finetuning(with RLHF) could play a major role in these AGI-level capabilities.

Nevertherless, LLMs have advanced SOTA on many levels on various NLP tasks — despite several people calling the scaling methods to be bruteforce. It will only be interesting to see how the industry applies these LLMs to real world problems and what awaits us in the future in the realm of LLMs :)

Neural Networks Intuitions: 18. Generative Pretrained Transformer(GPT) Series

1. Improving Language Understanding by Generative Pre-Training — GPT-1

Problem:

Solution:

1. Unsupervised Pretraining using Language Models:

Language Model(LM):

2. Supervised finetuning with a auxiliary LM objective:

Results:

Zero-Shot Behaviours of GPT:

Key takeaways from GPT:

2. Language Models are Unsupervised Multitask Learners — GPT-2

Problem:

Approach:

Training dataset:

Language Modeling on WebText and Zero-Shot Transfer on LM benchmarks:

Zero shot Transfer on other NLP tasks:

Hints of Generalization:

Generalization vs Memorization:

Key takeaways from GPT-2:

3. Language Models are Few-Shot Learners — GPT-3

1. Motivation:

2. Approach

3. Training Dataset

Few Shot Learning Abilities of GPT-3:

GPT-3 results on various tasks and datasets:

Test data contamination — is GPT-3 only memorizing?

Key takeaways from GPT-3

4. Training language models to follow instructions with human feedback — InstructGPT

Problem:

Solution:

Training Dataset:

InstructGPT vs GPT3.5:

Key takeaway from InstructGPT:

Summary — Evolution of GPTs:

My take on LLMs:

Written by Raghul Asokan

No responses yet