Sentence and document embeddings

Embeddings of longer natural language passages can be very useful for efficiently establishing the similarity of text documents. Given this use case, the most common approach as of this writing is to employ transformer models that have been fine-tuned on a contrastive loss objective, such that documents known to be similar are closer in the latent space than documents that are not.

A prominent example of such a model is Sentence BERT (SBERT, S-BERT) (sentence BERT). Note, however, that the classification token makes BERT itself capable of a kind of “sentence” embedding, and it is often used as such.

Prior to the advent of transformers, it was more common to use static document embeddings using models like paragraph2vec and document2vec. These models generalize the skip-gram and CBOW approaches of word2vec to produce analogous embeddings for a longer passage. These models essentially made it possible to use much larger context windows than word2vec’s CBOW, and hence to embed somewhat richer semantics. Nevertheless, these embeddings were still based on the content of a single document.

By comparison, GloVe produces word embeddings based on word co-occurrences across documents. As such, GloVe word embeddings captured somewhat different features than document2vec (though training across the corpus would capture some of the same information). GloVe was also often faster to inference, and generally required less training data, than document2vec. Hence it was not uncommon, especially in resource-constrained environments, to compute GloVe embeddings for all of the words in a document and then give the document embedding as the TF-IDF weighted average.

There are a number of other approaches to creating document embeddings using neural networks, such as autoencoders and RNNs, but their contemporary use (outside of research) has become niche since the rise of transformers.

Technically, latent Dirichlet allocation also produces embeddings, though I didn’t think of them as such back then. That said, I certainly used it for comparing documents via cosine similarity, which is something I’d do today with a state-of-the-art BERT or SBERT encoding.

David's raw ML reference notes

Explorer

Sentence and document embeddings

Graph View

Backlinks