Special tokens in language models

Language models represent the input sequence (and often the output sequence) as indices into a vector of known tokens called a vocabulary. Often, there are important positional concepts that are not captured as explicit words, such as the end of the sequence. In these cases, the concept is encoded as a special token in the vocabulary. Examples of common special tokens include:

<PAD>: indicates a meaning-free token that was added only to facilitate computation.
<START> or <BOS>: indicates the beginning of a sequence.
<STOP> or <EOS>: indicates the end of a sequence.
<UNK>: indicates an unknown token.

David's raw ML reference notes

Explorer

Special tokens in language models

Graph View

Backlinks