CLIP is a two-tower neural network that employs a Vision transformer (ViT) and a text transformer, then aligns them using a specialized contrastive loss function. As with all two-tower models, it is used to create aligned embeddings.
1 min read
CLIP is a two-tower neural network that employs a Vision transformer (ViT) and a text transformer, then aligns them using a specialized contrastive loss function. As with all two-tower models, it is used to create aligned embeddings.