tl;dr these models are extremely different and just have really unfortunate naming. But let’s sort it out.
- CLIP is a two-tower model that produces aligned text and image embeddings.
- BLIP uses cross-modal attention to align text and image embeddings.
- CLIP cannot be used to generate text.
- BLIP can be used to generate text by adding a projection layer that produces probabilities for each word in the vocabulary.
- BLIP can also be used like CLIP, to produce embeddings.
- BLIP-2, despite the name, basically works completely differently from BLIP: a. It encodes an image by itself (using the frozen pretrained image embedding stack from CLIP). b. Then it passes this image embedding, along with the text query, to a novel kind of cross-modal attention transformer (called a “Q-former”). c. Then it passes the output sequence from this novel transformer to a frozen LLM. d. The only thing that gets trained is the Q-former. e. Despite the sequential creation of the image and text embeddings, the use of cross-modal attention ensures alignment. Hence, they can still be used as image-text embedding pairs. Here’s why they aren’t completely unrelated:
- The use of the pre-trained layer from CLIP means that BLIP-2’s image embeddings already represent a set of features suitable for image-text alignment.
- BLIP and BLIP-2 can be used as a drop-in replacement for CLIP (though not vice versa).