Neural Networks Intuitions: 11. Contrastive Language-Image Pretraining(CLIP)

5 min readApr 17, 2021

Hello everyone!

Welcome back to my series Neural Networks Intuitions. This article is going to be about a very recent interesting advancement made in Computer Vision — Learning Transferable Visual Models From Natural Language Supervision. i.e training models on one dataset and transferring it to unseen datasets without any supervision(with the help of Natural Language)

CLIP also showcases how input-output formulation is an important aspect in solving any task using neural nets.

Let us first understand the well known issue with neural nets, the concept of transer learning and then look into CLIP in detail.

Problem: Neural Networks cannot handle out-of-distribution(OOD) data

What does out-of-distribution data mean?

OOD can mean:

The input data(i.e images) contains classes different from the ones that we had trained the NN for.
Too many variations wrt lighting, shadow, environment etc.

Let’s take image classification as an example. Neural Networks in general learn good meaningful representations from a given train set and classifies well when images(unseen) are shown from the same distribution.

Here same distribution images means, these images contain classes that were present in the train set with little or no deviation in class patterns.

Since a classifier has fixed classes, it is not possible to map unseen patterns/classes to the known fixed class set(this is still possible through a suboptimal way which we will see later). This is why we say Neural Networks learn task specific representations rather than generic representations, therefore Neural Networks lack the ability to generalize.

So, now we know neural networks do not generalize to unseen tasks. But the representations learnt on one task can still be useful if the new/downstream task bears some similarity with the original task, or even if it doesn’t have any relation with the original task, it still makes sense to make use of those weights as initialization rather than training from random initialization.

This is what we call as Transfer Learning or Pretraining.

Contrastive Language-Image Pretraining(CLIP)

The paper makes use of Contrastive Learning for pretraining and later its input-output formulation paves the way for effective zero shot prediction. We will take a look at each of these in detail!

1. Contrastive Pretraining:

The concept used here is Contrastive Learning. In Computer vision, a contrastively trained model learns representations by comparing two input images while in CLIP, one input is an image while the other input is a text.

If you are not already aware of Contrastive Learning, please check my article here: Distance Metric Learning

In short, Distance Metric Learning means learning a distance in a low dimensional space(non-input space) such that similar images in the input space result in similar representation(low distance) and dissimilar images result in varied representation(high distance).

*Although the input is not necessarily an image always.

Now let us see how two different modalities are fed as input to a Pairwise or a Siamese Network in case of CLIP.

2. Input-Output Formulation:

In simple terms,

The pairwise network in CLIP tries to learn a distance in a low dimensional space such that an image and its corresponding text caption have similar representations and the same image with a totally unrelated caption have dissimilar representations.

The flow is as follows(similar to conventional Contrastive Learning):

The dataset consists of loads of <image, text caption> pairs.
The image and text caption are passed to respective image and text encoders and the embeddings(or representations) are extracted.
Embeddings of positive pairs are trained to have low distance while embeddings of negative pairs(an image with every other caption in the batch other than its counterpart) are trained to have a much higher distance.
Steps 2–3 repeated across all pairs in the dataset.

The trained Image and Text encoders now produce embeddings which can relate any image and any text caption.

Contrastive Pretraining Approach in CLIP [https://arxiv.org/pdf/2103.00020.pdf]

Note: CLIP is not the first paper to propose this text-image pretraining.

This elegant input-output formulation solves another difficult problem which is Zero Shot Prediction.

3. Zero Shot Prediction:

Let us first understand what zero shot prediction means and how CLIP elegantly solves it.

Zero Shot Prediction(wrt Neural Networks) means making predictions on dataset A using a neural network(trained on a different dataset) without any sort of training using dataset A.

For eg: Using a model trained on a CAR dataset to predict DOG breeds.

The question here is how do you make such predictions. The answer is:

Consider a Dog dataset on which a neural net is already trained and a Car dataset on which inference is to be done. Let us assume ground truth is available for both datasets.
We throw away the fully connected layer of the neural network and keep its backbone.
Consider some reference car images for every class (5–10 per class). By doing this, we will be having N reference image groups. For each reference group, create a reference embedding(possibly a centroid) by passing them through the network.
Take an image from Car dataset, pass it through the network to get the image embedding.
Compare the image embedding(from (4)) with all reference embeddings(from (3)) and label it as the group to which it is closest to.

This method is fine, but it requires such reference images already labelled from the new Car dataset. Otherwise, this method may not effective.

What if we were able to use the Class Label name(yes, the literal class name text) as the reference and use it as reference embeddings to predict classes on unseen datasets.

With that input-output formulation, we are now able to make use of class texts(from new datasets), generate text embeddings out of them and then compare it with image embeddings(from new datasets).

Zero Shot Prediction as mentioned in CLIP [https://arxiv.org/pdf/2103.00020.pdf]

Results of Zero Shot Prediction:

Thoughts:

There are still drawbacks to this approach. It requires loads of labelled training data and it still produces poor results on certain unseen datasets. But making use of text for zero shot prediction is an important breakthrough in training more robust, generalized neural nets. That is why, text is an important aspect to achieve generalization in vision. Because we are now using one more feature in-addition to just images and text can inherently have that context unlike images.

That’s all in this article. I hope you all understood the simple yet elegant approach of Contrastive Language-Image Pretraining and how effective it can be on unseen datasets.

Cheers :)