• BERT sequences consist of one or two “sentences”
    • Sentences are also called “segments”
    • Sentences are typically labeled as “segment A” and “segment B”
    • Two-sentence sequences are separated by a special [SEP] token
  • Every sequence starts with a classification token ([CLS])
    • Since the classification token encodes information about the entire sequence, it can be used as a sequence embedding
  • Uses WordPiece tokenizer for sub-word tokenization
  • Three separate embeddings are learned and then summed element-wise:
    • A segment embedding
      • One each for the first and second “sentence”
      • Sequences with only one “sentence” have just one segment embedding
    • A position embedding
      • Depends only on position
    • A token embedding
      • Depends only on the token index
  • The classification token [CLS] gets these three parts as well
    • For two-sentence sequences, [CLS] is usually given the segment embedding for the first sentence (“segment A”)
    • For one-sentence sequences, everything (including [CLS]) gets the segment A embedding