Masked language modeling (MLM) is a self-supervised learning task for training certain language models. It consists of predicting missing (masked) words in a sequence given their context. In BERT, it is executed as follows:

  • In 80% of examples, the focal word is replaced with the [MASK] token.
  • In 10% of examples, the focal word is replaced with a random token.
  • In 10% of examples, the focal word is not replaced.

MLM recalls an earlier, unordered context prediction task employed by word2vec called continuous bag-of-words. In both models, learning to predict a word from context forces an embedding model to discover the latent semantics of the expected word. As such, they can both be viewed as a form of de-noising.