tl;dr these models are extremely different and just have really unfortunate naming. But let’s sort it out.

  1. CLIP is a two-tower model that produces aligned text and image embeddings.
  2. BLIP uses cross-modal attention to align text and image embeddings.
  3. CLIP cannot be used to generate text.
  4. BLIP can be used to generate text by adding a projection layer that produces probabilities for each word in the vocabulary.
  5. BLIP can also be used like CLIP, to produce embeddings.
  6. BLIP-2, despite the name, basically works completely differently from BLIP: a. It encodes an image by itself (using the frozen pretrained image embedding stack from CLIP). b. Then it passes this image embedding, along with the text query, to a novel kind of cross-modal attention transformer (called a “Q-former”). c. Then it passes the output sequence from this novel transformer to a frozen LLM. d. The only thing that gets trained is the Q-former. e. Despite the sequential creation of the image and text embeddings, the use of cross-modal attention ensures alignment. Hence, they can still be used as image-text embedding pairs. Here’s why they aren’t completely unrelated:
  • The use of the pre-trained layer from CLIP means that BLIP-2’s image embeddings already represent a set of features suitable for image-text alignment.
  • BLIP and BLIP-2 can be used as a drop-in replacement for CLIP (though not vice versa).