BLIP-2 is a multimodal large language model that, despite its name, works quite differently from BLIP. Like BLIP, it can generate aligned image-text embedding pairs or, through a projection layer, be used to generate text.
BLIP-2 consists of three units:
- A frozen image embedding model;
- A novel cross-modal attention transformer called a “Q-former”; and
- A large language model.
The only thing that gets trained is the Q-former; everything else is left frozen.
Despite the sequential creation of the image and text embeddings, the use of cross-modal attention ensures alignment. Hence, they can still be used as image-text embedding pairs.
The embedding model and the large language model are both modular; you can swap in any that you like.
In practice, one typically would use the pre-trained image embedding tower from CLIP (or a model trained in a similar process). The use of CLIP ensures that the image embedding model already has features optimized for image-text alignment.
The choice of LLM is a matter of convenience; Hugging Face’s BLIP-2 collection features instances built with various flavors of OPT and FLAN.