Member-only story
CLIP vs DINOv2: Which one is better for image retrieval?
Compare CLIP and DINOv2 for image retrieval tasks, and learn about their specific powers and weaknesses to decide the best one for your custom data.
CLIP has been one of my favorite multimodal models since its publication. Being mostly my first-to-go option for any computer vision task thanks to its foundational character, I use it often also for my image retrieval projects.
Check out my previous blog if you want to know more about CLIP and the terms of “multimodal”, and “foundational”.
While working on an image retrieval project having 2 different types of classes, I realized that CLIP works awesome for 1 type, but it has a poor performance for the other one. Let’s say these classes are “0:flags”, and “1:tattoos”. The retrieval results are perfect in case I want to retrieve similar flags to my input flag image, but it doesn't show the same powerful accuracy when my input image is a tattoo crop, and I want to retrieve similar tattoo images from my database.
Let’s take a look at some CLIP retrieval outputs for some flag examples:

In the above plots, we see the results of my custom flag image retrieval dataset. Images belonging to the 3 different types of flags are collected, and 1 image per flag type is used as a query (the input image we give in image retrieval tasks, and want to retrieve the similar ones) and all the rest becomes the gallery (the database we are searching the similar images inside). We see that CLIP retrieves top-10 images from the gallery without any error. If the query image is a Greek flag, all the retrieved ones are also Greek flags, with no error.
Seems perfect! Does it have the same performance for other type of objects? For example retrieval for tattoos?