Recall that language models make predictions based on a sequence of terms from a vocabulary. Often the prediction is another sequence of terms from a (possibly different) vocabulary.
In this case, we can train such a model by summing the cross-entropy loss between each one-hot encoded element
is the loss function; is the length of the ground-truth sequence ; and is the size of the vocabulary .
Notice, though, that
Dealing with length mismatches
In practice, the predicted sequence <PAD> token. The <PAD> tokens are ignored for the purpose of loss. Hence, by default, early termination would result only in the loss associated with mispredicting the <STOP> token. This could lead to inappropriately low loss in some situations, especially if the sequence is terminated very early.
In practice, this is often not a big problem, since minimizing the loss for all tokens also minimizes it for the <STOP> token. For LLMs in particular, alignment tuning will usually remedy this problem while also solving many others. In less involved training environments, one could introduce an ad-hoc penalty relating the prediction length to the ground truth length.
Why does this work?
At first blush, it seems surprising that simply calculating next-token loss would be enough to perform initial training of even the most intelligent large language models. But as far back as 1948, Claude Shannon had observed that sampling from the distribution of English
When you consider that transformers in particular use increasingly abstract features as they progress through their layers, “next-word prediction” is actually quite semantically rich. To quote Ilya Sutskever during a “fireside chat”:
Suppose you read a detective novel. It [has] a complicated plot, a storyline, different characters, lots of events, mysteries [and] clues. It’s unclear. Then, let’s say that at the last page of the book, the detective has gathered all the clues, gathered all the people, and [says] “Okay, I’m going to reveal the identity of whoever committed the crime. And that person’s name is…”