Generative pre-training

Generative pre-training refers to the practice of pre-training a sequence-based model by causing it to reconstruct (“generate”) a sequence.

The term “generative pre-training” originated in Radford, et al. (2018). In that paper, it referred specifically to next-word prediction. This is essentially the task used by GPT-2 and GPT-3 as well; they just use much more data. (We don’t know what GPT-4 does, but we can be sure GPT-4o does something else.)
Although they didn’t use the term, Merity, et al. (2017) actually used the same training objective as GPT. The main difference was the architecture; where Radford, et al. used a transformer, Merity et al. used an LSTM.
T5 uses a generative de-noising objective called “span corruption,” in which the model learns to reconstruct missing subsequences from context.
iGPT treats images as sequences of pixels, with its task being to predict the next pixel in the sequence. (Note that this is quite different from, e.g., ViT, whose training objective is much more similar to the MLM objective used by BERT.)

David's raw ML reference notes