Next-word prediction is a task in natural language processing. While it has gained newfound prominence due to its role in generative pre-training, it dates back to Claude Shannon’s seminal papers on information theory. It is defined in terms of a likelihood of a corpus

where

  • is the set of all tokens;
  • is the context length; and
  • is a parameter vector.

This can be expressed as a log loss :

Note that this representation is independent of the model architecture that is used to predict the probability of .