Bootstrapping Language-Image Pre-training (BLIP)

BLIP is a multimodal large language model. It consists of four stacks:

An image encoder based on the vision transformer;
A text encoder based on the text transformer;
An “image-grounded text encoder” that employs cross-modal attention; and
An “image-grounded text decoder” that also employs cross-modal attention.

Its basic output consists of a pair of aligned embeddings. However, by adding a projection layer, it can generate text, as with GPT, but based on the image-text pair.

Unlike, BLIP-2, BLIP does not allow you to replace the LLM that it employs.

David's raw ML reference notes

Explorer

Bootstrapping Language-Image Pre-training (BLIP)

Graph View

Backlinks