BLIP is a multimodal large language model. It consists of four stacks:
- An image encoder based on the vision transformer;
- A text encoder based on the text transformer;
- An “image-grounded text encoder” that employs cross-modal attention; and
- An “image-grounded text decoder” that also employs cross-modal attention.
Its basic output consists of a pair of aligned embeddings. However, by adding a projection layer, it can generate text, as with GPT, but based on the image-text pair.
Unlike, BLIP-2, BLIP does not allow you to replace the LLM that it employs.