David's raw ML reference notes

❯

01 Statistical (machine) learning (data science)

❯

❯

Natural language processing (NLP)

❯

NLP specific transformers

❯

Multimodal models

❯

Contrastive image-language pre-training (CLIP)

Contrastive image-language pre-training (CLIP)

Feb 14, 20251 min read

CLIP is a two-tower neural network that employs a Vision transformer (ViT) and a text transformer, then aligns them using a specialized contrastive loss function. As with all two-tower models, it is used to create aligned embeddings.

Graph View

Backlinks

Two-tower neural network
BLIP-2
CLIP, BLIP, and BLIP-2

Created with Quartz v4.4.0 © 2025

Terms of Use
LinkedIn
Buy me a coffee