Recall that language models make predictions based on a sequence of terms from a vocabulary. Often the prediction is another sequence of terms from a (possibly different) vocabulary.

In this case, we can train such a model by summing the cross-entropy loss between each one-hot encoded element in the ground-truth output sequence and the predicted probability distribution :

where

  • is the loss function;
  • is the length of the ground-truth sequence ; and
  • is the size of the vocabulary .

Notice, though, that for all except for the one ground-truth label. So in practice, this can be simplified to

where is the index of the -th ground truth label .

Dealing with length mismatches

In practice, the predicted sequence is unlikely to exactly match the length of the ground-truth sequence , especially as becomes large. This is typically handled through padding, where the shorter of the two is filled in with a special <PAD> token. The <PAD> tokens are ignored for the purpose of loss. Hence, by default, early termination would result only in the loss associated with mispredicting the <STOP> token. This could lead to inappropriately low loss in some situations, especially if the sequence is terminated very early.

In practice, this is often not a big problem, since minimizing the loss for all tokens also minimizes it for the <STOP> token. For LLMs in particular, alignment tuning will usually remedy this problem while also solving many others. In less involved training environments, one could introduce an ad-hoc penalty relating the prediction length to the ground truth length.

Why does this work?

At first blush, it seems surprising that simply calculating next-token loss would be enough to perform initial training of even the most intelligent large language models. But as far back as 1948, Claude Shannon had observed that sampling from the distribution of English -grams produces intelligible English text up to words out.

When you consider that transformers in particular use increasingly abstract features as they progress through their layers, “next-word prediction” is actually quite semantically rich. To quote Ilya Sutskever during a “fireside chat”:

Suppose you read a detective novel. It [has] a complicated plot, a storyline, different characters, lots of events, mysteries [and] clues. It’s unclear. Then, let’s say that at the last page of the book, the detective has gathered all the clues, gathered all the people, and [says] “Okay, I’m going to reveal the identity of whoever committed the crime. And that person’s name is…”