Member-only story

CLIP: Multimodal, Foundational magic box of computer vision

6 min readFeb 12, 2025

Understand the state-of-the-art keywords of AI and create real examples using CLIP, a magical model that is both multimodal and foundational!

When I first met with CLIP and its performance, it impressed me so much that whenever I start a new project, I take CLIP’s performance as a baseline and then see some other state-of-the-art models to see if they can beat it to decide which model to go for that specific project.

Let’s start with the important terms of new-generation AI models:

Multimodal AI Models

A multimodal refers to any AI model having at least one of the following properties:

Inputs are of different modalities (the system can process both image and text)
Outputs are of different modalities (the system can generate both image and text)

As a computer vision engineer, I will focus on the “multimodal” term in this field and I will be examining models having text and image multimodalities in this blog.

Traditional computer vision models use images as input data types and extract the features from them to understand what should a cat look like (if the cat is in the class list of our dataset let's say), without having any idea about what a cat means contextually. Is that an animal or an object? The model doesn’t have any idea! What it learns is to check the features of the input image and decide if they match class…

CLIP: Multimodal, Foundational magic box of computer vision

Multimodal AI Models

Written by Yağmur Çiğdem Aktaş

No responses yet