Pre-training
BERT is pre-trained on two tasks sequentially:
-
Masked language modeling, in which a token is masked out of the text and the model predicts that token. The tokens are removed randomly, with 80% being replaced with
[MASK], and 10% each being replaced by a random token or not replaced at all. -
Next-sentence prediction, in which the model is given two sentences, and must predict if they are contiguous. The non-contiguous sentence is chosen randomly from the entire training corpus.
Both of these tasks use ordinary cross-entropy loss.
Fine-tuning
To fine-tune BERT, one replaces the final layer of the model with a task-specific head, and then trains the entire model. In this way, BERT acts as both a pipeline for feature extraction and as a weight initialization for a task-specific model; i.e., it is a hybrid of the two major paradigms of transfer learning.