BLIP-2 is a multimodal large language model that, despite its name, works quite differently from BLIP. Like BLIP, it can generate aligned image-text embedding pairs or, through a projection layer, be used to generate text.

BLIP-2 consists of three units:

  1. A frozen image embedding model;
  2. A novel cross-modal attention transformer called a “Q-former”; and
  3. A large language model.

The only thing that gets trained is the Q-former; everything else is left frozen.

Despite the sequential creation of the image and text embeddings, the use of cross-modal attention ensures alignment. Hence, they can still be used as image-text embedding pairs.

The embedding model and the large language model are both modular; you can swap in any that you like.

In practice, one typically would use the pre-trained image embedding tower from CLIP (or a model trained in a similar process). The use of CLIP ensures that the image embedding model already has features optimized for image-text alignment.

The choice of LLM is a matter of convenience; Hugging Face’s BLIP-2 collection features instances built with various flavors of OPT and FLAN.