At first blush, it seems surprising that language models would need a special token representing the start of a sequence. After all, it’s obvious to an observer that the sequence is starting if and only if it doesn’t have any tokens yet. In fact, such a token serves at least two jobs that are nearly universal:

  1. By encoding it in [[N-gram|-grams]] (or longer range positional correlations), it provides an anchor for positional information about how certain words are used, even in the absence of explicit positional encoding.
  2. For encoder-decoder models, it represents the initial hidden state of the decoder.

In addition, the start-of-sequence token serves at least two situational (but very common) roles:

  1. An empty input sequence is represented by just the start token. This property allows autoregressive models to process empty input without hard-coding a special handler.
  2. Certain models may encode multiple sequences in a single input (for example, in few-shot prompting). The start-of-sequence token enables the model to distinguish between the examples in the prompt.