Neural Networks Intuitions: 13. Out-of-Distribution Detection and HOD Loss — Paper Explanation
Good to be back with another article in my series “Neural Networks Intuitions”. This time we will be looking at an often overlooked yet an important and a difficult problem in neural networks — how do we say if the incoming input to a trained neural network is seen or unseen.
This problem is known as Out-of-Distribution(OOD) detection where the network has to determine if the input data that is fed is out of distribution or not(we will look at what this means in the next section).
In this article, I will be explaining the paper Does Your Dermatology Classifier Know What It Doesn’t Know? Detecting the Long-Tail of Unseen Conditions where the authors talk about an interesting loss formulation to detect ood inputs and also explain about other key approaches to solving this general problem of ood detection in neural networks.
Let us first understand what OOD actually is and look at the solutions that have been proposed till now on how to handle it :)
Out-of-distribution inputs(wrt. neural networks) usually means inputs that were not seen by the network during the training phase i.e. inputs outside training data distribution.
The term “unseen” can mean two or more things here:
- inputs that do not correspond to any of the class patterns that the network was trained on.
- inputs that differ highly in terms of texture, lighting conditions, environment etc. from that present part of the training set.
Let’s make this more concrete by taking an example.
Consider a fine grained image classification task where the task is to classify entities at a granular level where the inter-class similarity is usually high. Let’s take the problem of classifying a cat breed. The total number of classes(cat breeds) are 73.
This is an interesting problem given the nature of high inter-class similarity and can be even more challenging if the dataset is highly imbalanced. All that aside, one can still train a decent/good network which can predict these fine grained classes accurately.
As we know that the network will be fed only images of cat breeds during training — assuming that at test time, only images of the same cat breeds will be fed. Although this is a relatively fair assumption, it won’t work while building real world systems where there is high uncertainty.
Now the problem gets even more interesting when the network is shown an image of a tiger/leopard or even a totally irrelevant image such as a car or a picture of a person — what will be the output of that neural network?
The network simply gives one of the cat breed classes as the output. And it can even happen that the network can provide the same output with a higher confidence.
So how do we tackle such a problem? Are neural networks really equipped enough to handle this problem of OOD?
Inorder to know the answer to the above question, we should try to understand the nature of neural networks and see how to formulate the solution to OOD detection in neural networks.
In very simple terms, neural networks, in essence are universal function approximators which tries to learn a mapping from the input to the output. I won’t be going into these details but at a high level, this means given labelled input-output pairs and a loss function, the network parameters are adjusted so that this input-to-output mapping is learnt.
A key thing to note here is there is no concept of generalization in neural networks and what they mean by generalization is nothing but interpolation. This means it is necessary for the network to have seen the input inorder to correctly map it to its corresponding output. And this is one major limitation of NNs.
Please take a look at this wonderful thread on NNs interpolation by Francois Chollet — https://twitter.com/fchollet/status/1409941470304374787
Given this nature of NNs, how can one tackle this highly overlooked yet important problem of making the neural networks aware of what it has seen vs what it has not seen.
This problem can be tackled in two ways:
- Introduce network calibration/uncertainty estimation within the network during training, so that it is able to flag such ood inputs.
- Feed these out-of-distribution or outlier data part of training as much as possible(in an iterative/periodic manner) to make sure the network knows what an outlier looks like.
Based on my experience/intuitions, solution (2) can be the most reliable one despite being a brute-force/cumbersome approach. At present, given this limitation of NNs, this seems to be the best approach although a mix of (1) and (2) will be a more robust solution.
Now that we know how to tackle OOD problem in NNs, let us take the above problem of fine grained cat breed classification task and see how to make the network robust to such OOD inputs.
1. Softmax Confidence
One of the most straightforward way to estimate how certain the network prediction is, is by making use of its softmax confidence.
We could simply threshold the output say at a probability of 0.5 and flag any prediction below 0.5 as uncertain(or OOD in our case) and only consider predictions that have a probability of atleast 0.5.
Although this seems to be a easy solution, this does not always work in practice and is totally unreliable when it comes to real world problems.
And the reason being, neural networks often produce highly confident predictions even for out-of-distribution inputs.
There are two reasons why the networks output high confidence even for unseen samples:
- First one is the obvious nature of NNs where it has to be fed samples during training inorder to map them to the respective classes, otherwise it can end up mapping it to one of the known classes.
- Nextly, high confidence is a result of using softmax at the output layer where all softmax does is convert the raw class logits into a probability distribution by pushing them closer to 0s and 1s based on their logits.
Please check out some early papers on the same:
This is bad as we are now not able to make use of the softmax confidence just as it is :(
But, is there a way to calibrate the confidence of the network to make it output not so highly confident predictions — neither too high nor too low probabilities.
Yes, there is and Label Smoothing is the solution.
2. Label Smoothing
Label smoothing is a regularization technique which makes a neural network output not so highly confident predictions in general, there-by preventing overfitting and enabling generalization in them.
This is usually achieved by replacing the hard ground-truth labels such as 1s and 0s with soft ground-truth labels such as 0.9(instead of 1) and 0.1(instead of 0)(just for example). In essence, a noise distribution is introduced which prevents the ground truth labels becoming precisely 0s and 1s but only as near 0s and 1s.
And why does this help?
As we know neural nets usually produce highly confident predictions even for OOD inputs and using such hard ground-truth labels which forces networks to output near 1s and 0s won’t really help the cause. Inorder to tackle this, label smoothing introduces noise part of the ground truth labels and forces the network to output reasonable probabilities corresponding to each input.
3. Training with a single reject/unknown/outlier class
All these solutions which tinkers with network output confidence is good, but there is still the underlying problem that we need to address:
NNs require inputs to be shown during training inorder to map them to a respective class output.
So how do we solve this problem?
Let us take the above problem of fine grained cat breed classification task where the inputs that are fed in will not just be cat breeds but also images of cheetah, leopard and panther(which bears similarity with cat patterns)
In such cases, where the unknown or outlier class patterns are very similar to that of the known class patterns, even the network confidence won’t be reliable as the NNs can be fooled quite easily as it has not seen these images already during training.
Therefore, it is important for the network to be trained on these panther/leopard/cheetah images in order to learn the differences between an actual cat breed vs non-cats.
How do we do that?
Simply, create an outlier/unknown/reject class, put all these panther/leopard/cheetah images and now train the network with N+1 class outputs instead of just N, where N=73 in our case.
This should make the neural network robust to such unseen patterns which are quite similar to that of the known ones.
Let us add another step of complexity to this!
What if the network will be fed not just panther/leopard/cheetah images but also totally irrelevant patterns such as cars, bikes and landmark images.
Well, we already know the solution to this, don’t we?
We can jusr add all these new images to the same single outlier class and train the network to learn a cat breed and these new car/bike/landmark patterns.
But there is one problem here. As the space of possible patterns grow for this outlier class, the network is forced to learn to map all these highly diverse patterns such as panther, cheetah, leopard, cars, bikes and landmarks to a single class.
Even though this particular approach works in practice and the network still learns to map these different patterns to a single class, this has an indirect effect on the learning of known classes(cat breeds in out case).
Let’s look at how to deal with the above problem in the next section.
4. Training with a set of fine-grained(fanned out) reject/unknown/outlier classes
A better approach would be to fan out the outlier class into multiple outlier classes such as panther, leopard, cheetah, car, bike and landmark(6 in total) and train a network with N+6 classes instead of N+1.
The intuition behind why this might work better than having a single outlier class is, the network is now provided with consistent(unique) class patterns and helps it learn the mapping much more easily.
There are some empirical evidence to support the same. Please take a look at the table below:
Great! Now we are clear on how to deal with OOD inputs using outlier class. But there is still a drawback to this particular approach.
It requires data labelling for such outlier/unknown classes and continuous training of these networks to include new unseen patterns.
Yes, this has primarily been the limitation of neural networks and that’s what makes it difficult(not impossible) to use it in real-world.
Detectors are the best examples of NNs using outlier class
Till now, we have been trying to formulate a problem starting with cat breed classification and introducing new class patterns to increase the complexity at every step— but there is a natural occurrence of this problem in one of the real world Computer Vision tasks and that is Object Detection.
Let’s see how this problem relates to Object Detection and how a detector is actually solving it.
Object detection is a computer vision task where the task is to accurately predict the class type and location of an object of interest in a given image.
On the outset, the problem seems simpler where we are asked to detect a fixed number of known classes. But given the nature of NNs, now there is also a indirect requirement for the network to learn what is known vs what is unknown.
Consider any anchor based object detector — be it the first stage of a two-stage Faster-RCNN or a single stage SSD/YOLO.
In general, each of these will be trained with N known classes and an extra background class where any anchor that doesn’t intersect with any of the known object location is considered to be a background.
So the anchor based detector is fed not only patterns of known classes but also patterns of unknown classes in the form of background(a single outlier class as discussed above) and that is how the network learns to map known and unknown class patterns respectively.
And yes, this works in practice too!
(although this is not related to the scope of this article), I still feel it is this very aspect of Detectors — where they are forced to learn a universal set of patterns part of background class which not only provides inconsistent signals but also create a huge data imbalance, being a major factor in determining the network’s accuracy and lot of optimisations/improvements could be done here.
5. Hierarchical Outlier Detection Loss
We have seen past methods that can be used to tackle this problem of out-of-distribution in neural networks.
Now let’s look at a recent paper Does Your Dermatology Classifier Know What It Doesn’t Know? Detecting the Long-Tail of Unseen Conditions and understand how it solves ood problem.
The paper deals with a specific problem of dermatology classification and the dataset used exhibits the following characteristics:
- there are a total of 225 classes and are divided into two groups of classes: inliers and outliers - 26 classes part of inliers and 199 classes part of outliers. outliers are same as the unknown class we had discussed earlier.
- it follows a long tail distribution where there is huge data imbalance with inliers being abundant and outliers very scarce.
We can take a look at the below image to get a better understanding of the dataset used
So how do we solve this problem with the help of past solutions that we discussed?
We can simply augment these under-represented outlier samples and train with both inlier and outlier classes like before.
This makes sense, but because these outlier classes are fine-grained(fanned out), how do we deal with future unseen inputs? what if the ood input doesn’t fall under any of these patterns?
One way to deal with it is to add that pattern part of training(as seen already), but we can also devise a loss function that can enable the network to output low probabilities for ood inputs.
Okay, fine. But how is this done?
Consider there are N inliers and M outliers part of this dermatology dataset. In addition to the individual probabilities of these N+M classes, create two more probabilities — one for all inliers and another for all outliers(which is basically the sum of individual inliers and outliers separately). And the ground-truth for this new output will be [1, 0] for inliers and [0, 1] for outliers depending on its actual fine-grained class.
So there are two losses:
- one for fine grained classes(N+M) — L_fine for individual probabilities.
- another for coarse classes(just 2 — inlier+outlier) — L_coarse for aggregated probabilities.
And the total loss is the weighted sum of these two.
Take a look at the below diagram to get a clearer picture
This is great. But how does this help in detecting ood inputs?
As mentioned in paper, incase if there is an unseen test outlier input, the network might not output a higher probability for the already existing training outlier classes. But because of the coarse-grained loss, it may have relatively high probability in the aggregated probability for outlier classes. Or we could also look at it from the inliers perspective where the probability of aggregated inliers will be high only if the input falls under one of the inlier classes.
Why the word “hierarchical”?
If we look at this aggregated loss formulation more closely, we can see that in a way this auxiliary loss helps in learning to better classify the inputs into groups and not just into one of its individual classes.
To make it more obvious, let’s say there is a need to learn not just indivdual class patterns(classes could be car1, car2, bus1, bus2) but also learn to pick the group patterns(higher level in the hierarchy such as car, bus in general), this hierarchical loss function will be helpful!
That’s all in this article on OOD detection in neural networks.
The solutions mentioned might be more obvious to experienced DL practitioners but I haven’t really seen any article online that emphasizes more on the importance of OOD detection and how it is being done in practice.
And I would definitely like to know more about how people deal with the same problem(please post your suggestions/solutions on comments :))