Neural Networks Intuitions: 15. ASH — Paper Explanation(OOD Detection: Part 2)

Raghul Asokan
5 min readNov 13, 2022

--

Hello All!

It’s been quite some time since I had written blogs in my series “Neural Networks Intuitions”. This time, I am back with a very simple, interesting yet effective technique in Deep Learning that has a direct impact in building real world Computer Vision systems.

My recent couple of articles have been focused on the problem of “How do neural nets know what they don’t know?” aka. Out of Distribution Detection. And there are a good number of algorithms used to handle this problem, although its effectiveness in real world scenarios has always been a concern.

In today’s blogpost, I will be covering a recent SOTA paper to detect ood inputs in neural nets — Extremely Simple Activation Shaping for Out-of-Distribution Detection.

Let’s dive deep into the paper idea :)

Revisiting the Problem Statement: Out-of-Distribution Detection

Out-of-distribution detection in neural networks is quite a challenging task to surmount, as it has been well understood in learning theory that the concept of generalization(more specifically with neural nets) is nothing but interpolation.

Therefore, neural networks, when given any ood input(i.e something not part of the training set), maps the input to one of the existing known classes. This is a major pain point when neural networks are deployed in a real world setting where the chances of observing unknown samples are high, thereby causing the networks to fail miserably.

But there are a few common ways to tackle this problem. Few being:

  1. Continuous retraining of neural nets with more data coverage.
  2. Uncertainty estimation methods such as Monte-Carlo Dropout and few other methods.

You can go through more of these techniques from my previous articles :)

ID: In Distribution, OOD: Out-of-Distribution.

Solution — Extremely Simple Activation Shaping for Out-of-Distribution Detection(ASH):

The algorithm is as follows:

  1. Consider a neural network, already trained on dataset D with N classes(classification task).
  2. At test time/inference, pick the later feature maps in the network and zero out/prune C percentage of low value activations in these maps.
  3. Post zeroing out feature maps, there are three options:
    a. use the feature maps as it isASH-P(pruning)
    b. set the non-zero values to a constant(sum of activations before pruning/number of unpruned activations) → ASH-B(binarizing)
    c. multiply every non-zero value with e^s1/s2 where s1=sum of activations before pruning and s2=sum of activations after pruningASH-S(scaling)
  4. Continue the forward pass to get two outputs:
    a. the regular softmax probability to get the ID class label
    b. energy score from the raw logits
  5. a higher energy score usually represent an ID input while a low energy score means it is OOD — depending on the threshold set.
https://arxiv.org/abs/2209.09858

What is energy score?

Based on experiments, it is shown that energy scores outperform softmax scores when it comes to OOD detection and ASH also defaults to using energy scores.

To put it in simple words, at test time, feature maps(usually penultimate one) are pruned/zeroed out and passed on later to compute energy and softmax scores.

ASH Intuition:

We saw in the previous section on what the algorithm is and what happens during inference. But the big question here is,

how does this help in identifying the OOD inputs?

how does this not affect(in a major way) the accuracy of neural nets when it comes to ID data?

As one can see, the algorithm is quite simple. It drops out low value activations that contribute very less to the overall prediction of the incoming input sample and hence this preserves the accuracy/performance of network on the actual in-distribution data.

But by doing this, it also indirectly affects the energy score that represent the certainty of its prediction.

But “why does ASH indirectly help in OOD detection?” is still an open-ended question — as mentioned by the authors themselves.

An intuition one can offer is that, ASH is in a way similar to Monte-Carlo Dropout where the algorithm involves applying dropout multiple times(randomly) at test time and computes the variance in outputs across different forward passes, which helps in estimating the overall output certainty/uncertainty.

But that said, ASH still has its own novelty, where-in it combines the concept of dropout(with additional changes to the feature maps) along with energy to detect out-of-distribution inputs very effectively.

ASH vs the rest:

One of the major advantages of ASH over its counterparts such as DICE and REACT is its ability to preserve ID accuracy in the process of detecting OOD data.

This is a crucial aspect in the context of its usage in real world systems and hence makes it stand out from the rest.

Another advantage being ASH doesn’t require any changes to be made to the network or the training methodology as such, instead can be applied directly during inference.

As one can see below, the top right region in the graph is the ideal place to be in, as both ID and OOD accuracy is the highest in that region and that is exactly what ASH achieves.

https://arxiv.org/abs/2209.09858

ASH — Applications:

Besides presenting us with an accurate methodology to flag OOD/unseen samples, ASH also helps in:

  1. debugging neural networks — image of a known class getting flagged as OOD or an image of an unknown class getting flagged as ID can provide valuable insights into what the model is learning.
  2. data/samples falling under the low energy score range(OOD or near-OOD) usually are highly informative samples and can help in training robust classifiers.
  3. efficient/accurate way of online training.

and a lot more!

That’s all in this second segment on OOD detection. Hope this article provided a good understanding of ASH algorithm, the intuition behind using it and its real-world usage.

ASH can actually be quite an effective tool to be deployed in robust, accurate CV systems, that can simultaneously ensure in optimal monitoring of these systems in a real-world scenarios.

Cheers!

--

--

Raghul Asokan
Raghul Asokan

No responses yet