SBERT (“Sentence BERT”) is a variant of BERT that has been fine-tune to generate “sentence” (document) embeddings. It is optimized for tasks involving the direct comparison of documents, such as for retrieval.
The original BERT does provide a single embedding, the [[Classification token (CLS)|the [CLS] embedding]], that represents the entire sequence; however, it also produces an embedding for each other token in the sequence. When comparing sentences, preceding approaches generally fell into two categories:
- The fast way is to somehow combine the token embeddings into a pooled embedding, and then compare these pooled embeddings. This is fast, but (relatively) low-fidelity. Methods include:
- Mean token pooling (used in the SBERT paper)
- Using just the
[CLS]token
- The slow way is to perform a full token-wise comparison using cross-attention. BERT provides a way to do this directly by concatenating the two, separated by the
[SEP]token; it is also possible to obtain separate embeddings, concatenate them, and pass them to a task head.
As noted, SBERT essentially fine-tunes BERT to be better at “the fast way.” To accomplish this, they fine-tuned BERT to classify pairs of sentences as entailment, contradiction, or neither (neutral). The model was trained on the SNLI dataset.
Sentence BERT can be implemented as a Siamese or triplet network, though in practice this essentially boils down to making multiple passes through a single network in parallel.