word2vec (usually given in all lowercase) is a shallow neural network model for learning static word embeddings based on their local context. The input and output are both vocabulary-length vectors, with a single hidden embedding layer between them. There are two variants, representing opposite inference directions:

  • Continuous bag of words (CBOW): Given a central word, select words to either side of it, create an input vector consisting of the sum (or average) of these neighbors, and then predict the central word.
  • Skip-gram: Take a single word and predict the set of words (i.e., CBOW) within a window around it.

CBOW vs skip-gram embeddings

In practice, people rarely are using word2vec for the actual objectives on which it is trained. Hence the key question is how the embeddings differ from one variant to the other.

CBOW embeddings

For most real-world applications of word2vec, we are looking to embed single words. The skip-gram’s training looks a lot more like that task, and it usually outperforms CBOW on offline ML metrics for real-world tasks. CBOW’s primary strength lies in its training efficiency: it can be trained on fewer training examples over fewer epochs, and each training inference takes less time. Hence CBOW is preferable when one’s training resources are limited, especially if the vocabulary is large. Note, though, that CBOW does poorly with rare words because its averaged input vectors cause it to learn a smoother embedding landscape.

As the CBOW model is trained an entire window of words, it can also be used in a pinch provide an embedding representing an entire sentence or even paragraph. Note, though, that it is not optimized for this use case, and there are also models in this family designed specifically for this purpose, such as paragraph2vec and document2vec.

Skip-gram embeddings

The skip-gram model is specifically optimized to embed single words to match their context. Hence it is ideal for single-word embedding tasks, and is where the famous “king - man + woman = queen” outcome arose.

Training word2vec

As an embedding model, word2vec has an unusually high dimensionality at its input and output layers. While a sparse input layer is unavoidable, it is highly desirable to reduce the number of predictions that must be made at the output layer.

The most common strategy for training word2vec is to use negative sampling. For CBOW, this involves making prediction for the correct word and a small number of incorrect ones; for skip-gram, it involves making predictions for all of the true words in addition to a few negative ones.

An alternative is to use hierarchical softmax, which computes an exact probability for every word in the vocabulary (rather than a sampled approximation) in time. This is generally used for very large vocabularies, where negative sampling (which is in the number of samples) needs a large for adequate training.