BERT introduced the concept of a classification token ([CLS]). THE [CLS] is a technically a start-of-sequence token, as is found in many models that take sequences as input. However, the architecture of the BERT model gives it a powerful semantic significance.
Since BERT is a transformer encoder, the first embedding from its output corresponds to its start-of-sequence token, [CLS]. And since BERT’s attention is bidirectional, the token can attend to every position in the entire sequence.
As a result, the [CLS] token can be seen as an embedding of the entire sequence. Compare this to a causal (unidirectional) model like GPT. Each element in GPT’s input sequence can only attend to the elements before it. By definition, the start-of-sequence token has no preceding tokens. As a result, the start-of-sequence token cannot attend to any other positions.
By the same reasoning, in a GPT model, an end-of-sequence token (which has no subsequent tokens) could attend to the rest of the sequence, but no other positions in the sequence could attend to it.
Because of the [CLS] token, BERT is seen as both a word embedding and a sentence embedding model.
The [CLS] token is one of the most studied properties of BERT’s behavior, and is a major topic of so-called BERTology.