Pre-training

BERT is pre-trained on two tasks sequentially:

  1. Masked language modeling, in which a token is masked out of the text and the model predicts that token. The tokens are removed randomly, with 80% being replaced with [MASK], and 10% each being replaced by a random token or not replaced at all.

  2. Next-sentence prediction, in which the model is given two sentences, and must predict if they are contiguous. The non-contiguous sentence is chosen randomly from the entire training corpus.

Both of these tasks use ordinary cross-entropy loss.

Fine-tuning

To fine-tune BERT, one replaces the final layer of the model with a task-specific head, and then trains the entire model. In this way, BERT acts as both a pipeline for feature extraction and as a weight initialization for a task-specific model; i.e., it is a hybrid of the two major paradigms of transfer learning.