REALM
REALM predates both DPR and RAG, and essentially attempts to add a nonparametric memory to BERT. REALM learns three BERT models: two for its retriever (one for the query and one for the documents), and one for its encoder. The documents are embedded and stored in a vector index. During a forward pass, the model embeds the query and does a top-k MIPS against the documents. For each retrieved document, it concatenates the query and the document, then passes this through the third BERT with either a Cloze task head (pre-training) or an extractive QA task head (fine-tuning). The
REALM retrieves from the index during every forward pass, even during training. As a result, the embeddings in the index grow gradually stale until the index can be refreshed (asynchronously). This is not as crazy as it sounds, as it’s somewhat like how asynchronous parameter servers work. Still, REALM is quite complex; DPR greatly simplified it.
DPR
DPR: DPR skips the independent training on the Cloze task, instead simultaneously training two embedding models to maximize the likelihood of the correct document against the other documents in the batch using a novel softmaxed negative log likelihood loss. As part of their experimental setup, they coupled DPR to BERT for extractive QA by simply concatenating the selected document to the query, separated by a special token.
RAG
RAG: RAG uses DPR (the retriever itself, not the experiment with BERT) to select the top k most relevant documents, and then uses the concatenation of the query and each of the documents to generate tokens. There are two (or arguably three) variants in the paper, but they boil down to doing a beam search conditioned on each of the top k documents having been selected, and weighted by the probability that the document was the correct one in the first place.