BLIP is a multimodal large language model. It consists of four stacks:

Its basic output consists of a pair of aligned embeddings. However, by adding a projection layer, it can generate text, as with GPT, but based on the image-text pair.

Unlike, BLIP-2, BLIP does not allow you to replace the LLM that it employs.