Next-word prediction is a task in natural language processing. While it has gained newfound prominence due to its role in generative pre-training, it dates back to Claude Shannon’s seminal papers on information theory. It is defined in terms of a likelihood
where
is the set of all tokens; is the context length; and is a parameter vector.
This can be expressed as a log loss
Note that this representation is independent of the model architecture that is used to predict the probability of