Summary
Introduced the generative pretrained model, which is a variant of the decoder of the Vaswani, et al. transformer model. See linked pages for details.
Tasks from OmniFocus
-
Radford 2018 @parallel(true) @autodone(false) General idea: unsupervised pretraining + supervised task-specific fine tuning.
-
Generative pre-training @parallel(false) @autodone(false)
-
Discriminative fine-tuning @parallel(false) @autodone(false)
-
Task-specific fine tuning in Radford 2018 @parallel(false) @autodone(false)
-
Multiple choice question answering NLP tasks @parallel(false) @autodone(false) Tasks involve a context document, a question, and a set of possible answers. Create separate training examples with binary labels for each possible answer. Make separate inferences for each answer, then softmax across all answers to obtain a prediction.
-
Similarity NLP tasks @parallel(false) @autodone(false) Concatenate the two sentences for comparison, separated by a delimeter token. Since order is not meaningful, include both orderings in the training corpus.
-
Entailment NLP tasks @parallel(false) @autodone(false) Prepare training examples by concatenating premise and hypothesis, separated by a special delimeter token
-
Classification NLP tasks @parallel(false) @autodone(false)
-
Unsupervised pre-training @parallel(false) @autodone(false) Ordinarily, a neural network’s parameters are initialized randomly or to zero. In unsupervised pretraining, the model is trained such that it provides a good starting point for task-specific fine-tuning.
This is a form of transfer learning. However, unlike feature extraction transfer learning, the foundation model’s layers are not frozen; rather, the derived model continues to train from a starting point that encodes generally relevant priors about its broader domain.
Erhan, et al. (2010) observes that pretraining behaves like a form of regularization, improving generalization of neural networks.
-
Likelihood @parallel(false) @autodone(false) The likelihood
of a parameter set is the probability of observing the data given those parameters. -
Negative log likelihood (NLL) loss @parallel(false) @autodone(false) Recall that the likelihood of the true label
is the probability of given the parameters and the data . The negative log likelihood is exactly what it sounds like. It’s used when your objective is to explicitly maximize the likelihood of the true label. For single-label classification, NLL is equivalent to cross-entropy loss. Hence, any NLL loss can be expressed as a CE loss. However, CE loss cannot be expressed as a NLL in cases where the true distribution is not a one-hot vector.
-
Radford unsupervised objective: “causal language modeling” @parallel(false) @autodone(false) Negative log likelihood of the data within a specified context window given the parameters of the model. This unidirectional prediction task is called “causal language modeling.”
-
Liu et al. 2018 @parallel(false) @autodone(false) “Generating Wikipedia by Summarizing Long Sequences” introduced the decoder-only transformer stack, which was then used for GPT-1 (Radford 2018).
-
Auxiliary objective @parallel(false) @autodone(false) An auxiliary objective is a separate objective function that is included (via a weighted sum) in the loss function used during SGD. It allows the model to take other goals into consideration.
-
Radford supervised objectives (incl/ secondary language modeling objective) @parallel(false) @autodone(false) See discussion with ChatGPT entitled “Objective function clarification”:
https://chatgpt.com/share/1a55cd8d-4e63-4163-89d6-8c839d8c040e
In short, they look to maximize the likelihood of the prediction from the linear layer (i.e., CE loss) with an auxiliary goal of language modeling (i.e., sum of NLL for all
-length subsequences) -
Decoder-only transformer @parallel(false) @autodone(false) Encoder-decoder models encode a structural prior that the source and target sequences are semantically equivalent but differently encoded. That is, they are optimized for translation.
Liu, et al. (2018) recognized that many language tasks, such as next-sentence prediction or question answering, do not conform to this assumption. To this end, they removed the encoder stack, resulting in a decoder-only transformer. (Note that, although Radford 2018 was published after Liu 2018, the former cites the latter as the source of the idea.)
The decoder-only transformer has no cross-attention mechanism; as such, it is essentially an alternating series of self-attention and positionwise feedforward modules. The self-attention may be bidirectional (as in BERT); or, by using masks, it may be unidirectional.
Decoder-only transformers using masked self-attention encode a structural prior that the
-th token is caused by the the preceding tokens. Applications that conform to this prior belief are collectively called “causal language modeling.” -
Special tokens @parallel(false) @autodone(false) Add note that these are often randomly initialized, such as in Radford 2018
-