- BERT sequences consist of one or two “sentences”
- Sentences are also called “segments”
- Sentences are typically labeled as “segment A” and “segment B”
- Two-sentence sequences are separated by a special
[SEP] token
- Every sequence starts with a classification token (
[CLS])
- Since the classification token encodes information about the entire sequence, it can be used as a sequence embedding
- Uses WordPiece tokenizer for sub-word tokenization
- Three separate embeddings are learned and then summed element-wise:
- A segment embedding
- One each for the first and second “sentence”
- Sequences with only one “sentence” have just one segment embedding
- A position embedding
- A token embedding
- Depends only on the token index
- The classification token
[CLS] gets these three parts as well
- For two-sentence sequences,
[CLS] is usually given the segment embedding for the first sentence (“segment A”)
- For one-sentence sequences, everything (including
[CLS]) gets the segment A embedding