CLIP
CLIP is a multi-modal pre-trained model from OpenAI that processes both text and image inputs to generate unified vector representations. It enables cross-modal retrieval, such as finding the most relevant image based on a text description. Supporting zero-shot classification and image-text matching, CLIP is a foundational model for building vision-language understanding systems.