Neural Networks Intuitions: 11. Contrastive Language-Image Pretraining(CLIP)

Hello everyone!

Welcome back to my series Neural Networks Intuitions. This article is going to be about a very recent interesting advancement made in Computer Vision — Learning Transferable Visual Models From Natural Language Supervision. i.e training models on one dataset and transferring it to unseen datasets without any supervision(with the help of Natural Language)

CLIP also showcases how input-output formulation is an important aspect in solving any task using neural nets.

Let us first understand the well known issue with neural nets, the concept of transer learning and then look into CLIP in detail.

Problem: Neural Networks cannot handle out-of-distribution(OOD) data

What does out-of-distribution data mean?

OOD can mean:

Let’s take image classification as an example. Neural Networks in general learn good meaningful representations from a given train set and classifies well when images(unseen) are shown from the same distribution.

Here same distribution images means, these images contain classes that were present in the train set with little or no deviation in class patterns.

Since a classifier has fixed classes, it is not possible to map unseen patterns/classes to the known fixed class set(this is still possible through a suboptimal way which we will see later). This is why we say Neural Networks learn task specific representations rather than generic representations, therefore Neural Networks lack the ability to generalize.

So, now we know neural networks do not generalize to unseen tasks. But the representations learnt on one task can still be useful if the new/downstream task bears some similarity with the original task, or even if it doesn’t have any relation with the original task, it still makes sense to make use of those weights as initialization rather than training from random initialization.

This is what we call as Transfer Learning or Pretraining.

Contrastive Language-Image Pretraining(CLIP)

The paper makes use of Contrastive Learning for pretraining and later its input-output formulation paves the way for effective zero shot prediction. We will take a look at each of these in detail!

1. Contrastive Pretraining:

The concept used here is Contrastive Learning. In Computer vision, a contrastively trained model learns representations by comparing two input images while in CLIP, one input is an image while the other input is a text.

If you are not already aware of Contrastive Learning, please check my article here: Distance Metric Learning

In short, Distance Metric Learning means learning a distance in a low dimensional space(non-input space) such that similar images in the input space result in similar representation(low distance) and dissimilar images result in varied representation(high distance).

*Although the input is not necessarily an image always.

Now let us see how two different modalities are fed as input to a Pairwise or a Siamese Network in case of CLIP.

2. Input-Output Formulation:

In simple terms,

The pairwise network in CLIP tries to learn a distance in a low dimensional space such that an image and its corresponding text caption have similar representations and the same image with a totally unrelated caption have dissimilar representations.

The flow is as follows(similar to conventional Contrastive Learning):

The trained Image and Text encoders now produce embeddings which can relate any image and any text caption.

Contrastive Pretraining Approach in CLIP [https://arxiv.org/pdf/2103.00020.pdf]

Note: CLIP is not the first paper to propose this text-image pretraining.

This elegant input-output formulation solves another difficult problem which is Zero Shot Prediction.

3. Zero Shot Prediction:

Let us first understand what zero shot prediction means and how CLIP elegantly solves it.

Zero Shot Prediction(wrt Neural Networks) means making predictions on dataset A using a neural network(trained on a different dataset) without any sort of training using dataset A.

For eg: Using a model trained on a CAR dataset to predict DOG breeds.

The question here is how do you make such predictions. The answer is:

This method is fine, but it requires such reference images already labelled from the new Car dataset. Otherwise, this method may not effective.

What if we were able to use the Class Label name(yes, the literal class name text) as the reference and use it as reference embeddings to predict classes on unseen datasets.

With that input-output formulation, we are now able to make use of class texts(from new datasets), generate text embeddings out of them and then compare it with image embeddings(from new datasets).

Zero Shot Prediction as mentioned in CLIP [https://arxiv.org/pdf/2103.00020.pdf]

Results of Zero Shot Prediction:

[https://arxiv.org/pdf/2103.00020.pdf]

Thoughts:

There are still drawbacks to this approach. It requires loads of labelled training data and it still produces poor results on certain unseen datasets. But making use of text for zero shot prediction is an important breakthrough in training more robust, generalized neural nets. That is why, text is an important aspect to achieve generalization in vision. Because we are now using one more feature in-addition to just images and text can inherently have that context unlike images.

That’s all in this article. I hope you all understood the simple yet elegant approach of Contrastive Language-Image Pretraining and how effective it can be on unseen datasets.

Cheers :)

Deep Learning Engineer at Infilect