At first blush, it seems surprising that language models would need a special token representing the start of a sequence. After all, it’s obvious to an observer that the sequence is starting if and only if it doesn’t have any tokens yet. In fact, such a token serves at least two jobs that are nearly universal:
- By encoding it in [[N-gram|
-grams]] (or longer range positional correlations), it provides an anchor for positional information about how certain words are used, even in the absence of explicit positional encoding. - For encoder-decoder models, it represents the initial hidden state of the decoder.
In addition, the start-of-sequence token serves at least two situational (but very common) roles:
- An empty input sequence is represented by just the start token. This property allows autoregressive models to process empty input without hard-coding a special handler.
- Certain models may encode multiple sequences in a single input (for example, in few-shot prompting). The start-of-sequence token enables the model to distinguish between the examples in the prompt.