Generative Pre-trained Transformer 3 (GPT-3)

#gpt3 #deeplearning #transformers #ai #nlp

Harsh Goyal Nov 18 2020 · 8 min read
Share this


Generative Pre-trained Transformer 3 (GPT-3) is an autoregressive language model that uses deep learning to produce human-like text. It is the third-generation language prediction model in the GPT-n series created by OpenAI, a San Francisco-based artificial intelligence research laboratory. GPT-3's full version has a capacity of 175 billion machine learning parameters. GPT-3, which was introduced in May 2020, and is in beta testing as of July 2020, is part of a trend in natural language processing (NLP) systems of pre-trained language representations. Before the release of GPT-3, the largest language model was Microsoft's Turing NLG, introduced in February 2020, with a capacity of 17 billion parameters or less than 10 percent compared to GPT-3.

The quality of the text generated by GPT-3 is so high that it is difficult to distinguish from that written by a human, which has both benefits and risks. Thirty-one OpenAI researchers and engineers presented the original May 28, 2020 paper introducing GPT-3. In their paper, they warned of GPT-3's potential dangers and called for research to mitigate risk. David Chalmers, an Australian philosopher, described GPT-3 as "one of the most interesting and important AI systems ever produced."

Microsoft announced on September 22, 2020 that it had licensed "exclusive" use of GPT-3; others can still use the public API to receive output, but only Microsoft has control of the source code.

GPT-3 is a very large language model. Given some input text, it can probabilistically determine what tokens from a known vocabulary will come next. Before we go ahead and see what makes GPT-3 so special, lets first understand what is a language model? 

What are Language Models?

Simply put, language models are statistical tools to predict the next word(s) in a sequence. In other words, language models are probability distribution over a sequence of words. Language models have many applications like:

  • Part of Speech (PoS) Tagging
  • Machine Translation
  • Text Classification
  • Speech Recognition
  • Information Retrieval
  • News Article Generation
  • Question Answering, etc.
  • A popular encoding method used in NLP is Word2Vec which was developed in 2014. The real boost to language models came in 2019 with the arrival of the “transformer”.

    What Makes OpenAI GPT-3 Different?

    The first thing that GPT-3 overwhelms with is its sheer size of trainable parameters which is 10x more than any previous model out there. 

    In general, the more parameters a model has, the more data is required to train the model. As per the creators, the OpenAI GPT-3 model has been trained about 45 TB text data from multiple sources which include Wikipedia and books. The multiple datasets used to train the model are shown below:

    Common Crawl corpus contains petabytes of data collected over 8 years of web crawling. The corpus contains raw web page data, metadata extracts and text extracts with light filtering. 

    WebText2 is the text of web pages from all outbound Reddit links from posts with 3+ upvotes. 

    Books1 & Books2 are two internet-based books corpora. 

    Wikipedia pages in the English language are also part of the training corpus. 

    The third column in the table “Weight in training mix” refers to the fraction of examples during training that is drawn from a given dataset. 

    A major issue in training models and in particular such large training models with so much data from the internet is that these models have the capacity to memorise the content and then contaminate downstream tasks like testing as they might have already seen the data. Though the creators of GPT-3 took some measures to avoid the training and test data overlaps but a bug in the filtering caused some of the data to leak. As mentioned in the paper, the team could not retrain the model due to the high cost associated with the training. 

    OpenAI GPT-3 Architecture

    The GPT-3 is not one single model but a family of models. Each model in the family has a different number of trainable parameters. The following table shows each model, architecture  and its corresponding parameters:

    Sizes, architectures, and learning hyper-parameters of the GPT-3 models

    In fact, the OpenAI GPT-3 family of models is based on the same transformer-based architecture of the GPT-2 model including the modified initialisation, pre-normalisation, reverse tokenisation, with the exception that it uses alternating dense and sparse attention patterns. 

    The largest version GPT-3 175B or “GPT-3” has 175 B Parameters, 96 attention layers and 3.2 M batch size. 

    Original Transformer Architecture

     Shown in the figure above is the original transformer architecture. As mentioned before, OpenAI GPT-3 is based on a similar architecture, just that it is quite larger. While language models like BERT use the Encoder to generate embeddings from the raw text which can be used in other machine learning applications, the GPT family use the Decoder half, so they take in embeddings and produce text.   

    Accuracy / Performance / Numbers of OpenAI GPT-3

    The various tasks that any language model can perform depend on how it is fine-tuned/updated. With GPT-3 many of the NLP tasks discussed earlier can be done without any fine-tuning, gradient or parameter updates which makes this model Task-Agnostic. So OpenAI GPT-3 can perform tasks with very few or no examples/demonstration (or shots as they are better known). Before we dive into the numbers lets first understand the concept of Zero/One/Few shot tasks with respect to the model and see how one can interact with the model using a few examples. 

    Above figure shows the three settings in which GPT-3 can perform the task of translating from English to French. 

    The Few-shot (FS) setting is kind of similar to how we go about training a machine learning model where we give some inputs and corresponding outputs to a model and then expect the model to perform on an unseen input. However, the difference here is that unlike a normal ML algorithm, the model does not do any weight updates. It just infers on the basis of the “shots” that it has been fed. One typically feeds in between 10-100 shots for one such setting (as per the paper). 

    One-Shot (1S) setting is the same as FS except that only one example/demo/context is fed to the model in addition to the last context(which is the task). 

    Zero-Shot (0S) is when there is no context allowed except for the last (which is the task). This kind of setting is “unfairly hard” as it could be difficult for even humans to understand what the task is with no example or demonstration. 

    The above image shows the accuracy of the OpenAI GPT-3 model while performing the Zero-shot, One-shot and Few-shots tasks along with the number of parameters and shots for a simple task to remove random symbols from a word. Now, let’s have a look at how the models (175B params to 125M params) perform at some well known (benchmarked) tasks. All results cited (quoted) from the paper. 

    OpenAI GPT-3: Closed Book Question Answering

    This task tests the ability of OpenAI GPT-3 to answer questions about broad factual knowledge. GPT-3 was tested on three different QA datasets. The results for the same are shown in the table below:

    The figure above shows GPT-3’s performance on the TriviaQA dataset. It can be observed how the performance grows with size and how 1S and FS settings beat 0S and match + exceed the SOTA on the task.

    OpenAI GPT-3: Language Translation

    Although GPT-3’s training data comprised of > 90% English text it did include some foreign language text. Following graph (taken from the paper) summarises the performance of GPT-3 on the language translation task. 

    GPT-3 translation performance in FS setting on 6 language pairs

    “For the three input languages studied, GPT-3 significantly outperforms prior unsupervised NMT work when translating into English but underperforms when translating in the other direction.”

    Winograd-Style Tasks

    The Winograd Schemas Challenge involves determining which word a pronoun refers to, when the pronoun is grammatically ambiguous but semantically unambiguous to a human. 

    “On Winograd GPT-3 achieves 88.3%, 89.7%, and 88.6% in the zero-shot, one-shot, and few-shot settings respectively, showing no clear in-context learning but in all cases achieving strong results just a few points below state-of-the-art and estimated human performance.”

    Common Sense Reasoning

    Three datasets were considered for this task. The first dataset PhysicalQA (PIQA) asks common sense questions about how the physical world works and is intended as a probe of grounded understanding of the world. “GPT-3 achieves 81.0% accuracy zero-shot, 80.5% accuracy one-shot, and 82.8% accuracy few-shot (the last measured on PIQA’s test server). This compares favourably to the 79.4% accuracy prior to the state-of-the-art of a fine-tuned RoBERTa.”

    There are few more results mentioned in the paper for tasks like reading comprehension, SuperGLUE, NLI, synthetic and qualitative tasks (arithmetic, word scrambling and manipulation, SAT analogies, News article generation, learning and using novel words, correcting English grammar). Let’s pick up the most interesting task of News Article Generation. 

    News Article Generation

    The release of GPT-2’s largest model was briefly on hold due to the controversy of it being capable of generating fake news. GPT-3 model was able to generate news articles that are practically indistinguishable from the real ones. One of the experiments showed that for the 175B model, humans were able to distinguish fake articles with only 52% accuracy. 

    Here are some of the examples of the fake news article generated by GPT-3 along with the accuracy that the human participants achieved in being able to distinguish it. 

    Article for which human’s had lowest accuracy in identifying (accuracy 12%), Source: paper
    Article generated by GPT-3 that was most easy for human participants to identify (accuracy 61%), Source: paper

    The plot below shows the human ability to detect model generated fake news articles. 

    It can be observed from the plot above that the ability to distinguish fake article decreases as the model size increases. 

    How Can We Get Our Hands on the Model?

    You can’t simply download the model or train it on your own even if you have the infrastructure. OpenAI has built an API which is accessible through a waiting list. You can visit their site and join the waiting list. In fact, you can go to the demo section of and try out some demos yourself to get a fair idea of how some of the use-cases work. 

    Demo section at

    If you select the Q&A task and click on “See cached response” button, you will get the following result:

    So if you were to do a task like the one shown above, you would need to write a code similar to the one shown below:
    As you can observe in the code snippet above, the API is provided with 5 contexts and the last Q is the task that the model needs to complete. It needs to predict what words will follow ‘A:’.

    Limitations of OpenAI GPT-3

    The creators of GPT-3 themselves accept that the model has its weaknesses and does commit silly mistakes. In particular, it does not perform well on text synthesis tasks like repetitions, contradictions, coherence loss over long passages, etc. However, this is not too different from other language models. The architecture also introduces a fundamental limitation on the model. The GPT-3 model is an autoregressive language model and not a bidirectional one (like BERT). So GPT-3 is more suited for tasks which are “in-context” learning-based and not the ones which depend on “fine-tuning”. 

    Shown below are the accuracy results of GPT-3 models on arithmetic tasks. It can be seen how smaller models perform poorly on simple tasks of even single-digit or double-digit arithmetic and accuracy on 4-digit (and above) arithmetic is low. 

    GPT-3 results on Arithmetic tasks with FS setting, Source: paper


    To summarise:

  • GPT-3 is a very large language model (the largest till date) with about 175B parameters.
  • It is trained on about 45TB of text data from different datasets.
  • As such the model itself has no knowledge, it is just good at predicting the next word(s) in the sequence. It is not designed to store or retrieve facts. 
  • It produces more fluent and human-like text outputs.
  • You don’t need task-specific datasets to accomplish a task using GPT-3. It is “Task-Agnostic”.
  • You cannot download or retrain the model. You need an API key (can get by joining the waitlist). It has “closed-API” access.
  • It is good mostly for English language tasks.
  • Longer outputs from the model tend to degrade.
  • The outputs can be biased and abusive. 
  • There are known contaminations in the benchmark experiments which have been called out clearly in paper. 
  • Even with the API still in the closed-beta state and a long waiting list, the AI community is quite excited about the potential and power of the model and how artificial general intelligence (AGI) is evolving. However, if we are to learn from the issues associated with GPT-2 we need to be more careful and responsible with what we create using this model. 


    This 72-page research paper describes in great detail the features, capabilities, performance and limitations of the model.   

    Read next