BERT embedding sequence structure

BERT sequences consist of one or two “sentences”
- Sentences are also called “segments”
- Sentences are typically labeled as “segment A” and “segment B”
- Two-sentence sequences are separated by a special [SEP] token
Every sequence starts with a classification token ([CLS])
- Since the classification token encodes information about the entire sequence, it can be used as a sequence embedding
Uses WordPiece tokenizer for sub-word tokenization
Three separate embeddings are learned and then summed element-wise:
- A segment embedding
  - One each for the first and second “sentence”
  - Sequences with only one “sentence” have just one segment embedding
- A position embedding
  - Depends only on position
- A token embedding
  - Depends only on the token index
The classification token [CLS] gets these three parts as well
- For two-sentence sequences, [CLS] is usually given the segment embedding for the first sentence (“segment A”)
- For one-sentence sequences, everything (including [CLS]) gets the segment A embedding

David's raw ML reference notes