Hello everyone!

Welcome back to my series Neural Networks Intuitions. This article is going to be about a very recent interesting advancement made in Computer Vision — Learning Transferable Visual Models From Natural Language Supervision. i.e training models on one dataset and transferring it to unseen datasets without any supervision(with the help of Natural Language)

CLIP also showcases how input-output formulation is an important aspect in solving any task using neural nets.

Let us first understand the well known issue with neural nets, the concept of transer learning and then look into CLIP in detail.

Problem: Neural Networks cannot handle out-of-distribution(OOD) data

What does out-of-distribution data mean?

Welcome everyone!

The tenth article in my series “Neural Networks Intuitions” is about an intriguing and efficient technique to learn representations from unlabelled data — Bootstrap Your Own Latent(BYOL) — A New Approach to Self-Supervised Learning.

For some background on transfer learning and self supervised learning, please check my previous article on Self-supervised Learning and SimCLR.

Let us understand the problem first and later see how BYOL solves it in an elegant way :)


Consider the problem of image classification with limited labelled data. Since we all know neural nets are data hungry, training this classifier on this limited dataset…

Distance Metric Learning

Welcome back to my series Neural Networks Intuitions. In this ninth segment, we will be looking into deep distance metric learning, the motivation behind using it, wide range of methods proposed and its applications.

Note: All techniques discussed in this article comes under Deep Metric Learning (DML) i.e distance metric learning using neural networks.

Distance Metric Learning:

Distance Metric Learning means learning a distance in a low dimensional space which is consistent with the notion of semantic similarity. (as given in [No Fuss Distance Metric Learning using Proxies])

What does the above statement mean w.r.t image domain?

It means learning a distance in…

Hello everyone!

This article is going to be a short one and focuses on a less significant but highly overlooked concept in object detectors, especially in single shot detectors — Translation Invariance.

Let’s understand what translation invariance is and what makes an image classifier/object detector translation invariant.

*Note: This article assumes you have background knowledge of how single and two-stage detectors work :-)

Translation Invariance:

Translation in computer vision means displacement in space and Invariance means the property of being unchanged.

Therefore when we say an image classifier or an object detector is translation invariant, it means:

Image Classifier can predict a…

Hello folks :-)

Today I will be talking about one of the most important and interesting topics in deep learning domain — Self-supervised Learning and a recent paper showing SOTA result in self-supervised learning — A Simple Framework for Contrastive Learning of Visual Representations.

The reason self-supervised learning is crucial is because manual labelling of data at scale is very, very expensive and tedious. Hence the focus is automatically on self-supervised and unsupervised learning domains — in order to reduce the need for labelling an enormous amount of data.

Let us first understand what self-supervised learning is and how it helps in solving…

Hey folks! It’s good to be back again :-)

It has been a while since I published my last article. In this sixth installment of my series “Neural Networks Intuitions”, I will be talking about one of the most widely used scene text detector — EAST(Efficient and Accurate Scene Text Detection) and as the name suggests it is not just accurate but much more efficient in comparison with its text detector counterparts.

Firstly, let us look at the problem of Scene Text Detection in general and then dive deep into the working of EAST :-)

Scene Text Detection:

Problem: The problem as already mentioned above, is to detect text in natural scene…

Hello everyone! Welcome back to my series on Neural Network Intuitions.

Today I will be talking about an elegant concept introduced in object detectors — Anchors, how they help in detecting objects in an image and how they differ from the traditional two-stage detectors.

As always let us look at the problem for which these anchors were introduced as solution :-)

Before starting with anchors, let us see how two-stage object detectors work and how they actually contribute to the evolution of single stage detectors.

Two-stage Object Detectors: A traditional two-stage object detector detects objects in an image in two…


Today I am going to talk about a topic which took me a long time(reaallyy long time) to understand, both the motivation of usage as well as its working-Connectionist Temporal Classification(CTC).

Before talking about CTC, let us first understand what Sequence-to-Sequence models are :-)

Sequence-to-Sequence Models: These are neural networks which takes in any sequence — characters, words, image pixels and gives an output sequence which could be in the same domain or a different one as the input.

Few examples of Sequence to Sequence models:

  1. Language translation problem — where input is a sequence of words(a sentence) in…

Hey everyone!

Today, in the series of neural network intuitions I am going to discuss RetinaNet: Focal Loss for Dense Object Detection paper. This paper talks about RetinaNet, a single shot object detector which is fast compared to the other two stage detectors and also solves a problem which all single shot detectors have in common — single shot detectors are not as accurate as two-stage object detectors.

Link to the paper: Focal Loss for Dense Object Detection

So how does RetinaNet solve this problem of inaccuracy which all single shot detectors possess in general? …

Hello everyone!

Today we are going to have a look at one of the interesting problems that has been solved using neural networks — “Image Style Transfer”. The problem is to take two images, extract content from one image, style(texture) from the other and seamlessly merge them together into one final image that looks realistic. This blog post is an explanation of the article A Neural Algorithm of Artistic Style by Gatys et. al (https://arxiv.org/abs/1508.06576)

Lets look at an example to make things clear.

Top left is the content image, bottom left is the style image, result is the one on the right

Interesting right? Let’s take a look at how to solve it.

Overview: Let me give a…

Raghul Asokan

Deep Learning Engineer at Infilect

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store