Neural Networks Intuitions: 14. Active Learning
Hello everyone!
Welcome back to my series “Neural Networks Intuitions”. In this fourteenth segment, I will be talking about a key concept in Machine Learning which has a direct impact on building real world, scalable ML/DL based systems — that helps in speeding up the labelling process as well as building accurate models that can also handle drift — Active Learning.
Before looking into what Active Learning is, let us first understand the problem from a real-world perspective, see where Active Learning fits in and how it helps in building scalable ML systems.
Problem:
Consider a real world setting where one has to build an ML system to solve computer vision tasks and at present, one of the best ways to solve CV tasks is using neural networks.
And as we all know, neural networks are data hungry and requires labelled data in abundance to train them.
Naturally, this poses a challenge in building these computer vision systems as this labelled data pretty much most of the times is obtained with manual labour and manual labelling is time consuming and expensive.
To add to the complexity, these networks have to be trained periodically with more diverse data to handle data drift that occurs in real world usecases.
Now let us consider a scenario where there already exists a CV based solution in production and due to the ever changing data in real world, the system’s accuracy degrades over time. Hence a requirement arises where this system has to be improved and one way to do so is by retraining these networks on the new data.
Assume there is abundant unlabelled data available and the requirement is to manually label them(not all of them). How can we speed up this process?
There are multiple ways to solve this problem of speeding up manual labelling:
- First straightforward approach is to pseudo label the unlabelled dataset using the already trained model.
- Make use of Self-Supervised Learning(SSL) techniques to learn representations from this unlabelled set and apply pseudo labels later.
followed by manual correction/tagging.
(and few other ways to optimize labelling).
But these approaches still do not take into account a practical difficulty where one cannot exhaustively tag all of the images from the unlabelled set. Depending on the usecase/time constraint(in real world setting), only a subset of these images are usually labelled.
Most of the times, this subset getting tagged is a randomly sampled set out of the total set. And there are high chances that the subset might not cover the possible variations/distribution shift that is exactly needed to improve the network’s accuracy in the upcoming iterations to handle the drift.
So what if there exists an optimal(or better) sampling strategy which retrieves the most informative images which when used to train the networks, can lead to par or even better performance as compared to training the same network with all of the data(or randomly sampled set)?
This is exactly what Active Learning is trying to solve :)
Solution: Active Learning
Active Learning, simply put, aims at maximizing the performance of machine learning models with fewest annotated samples or in other words, find most informative images from an unlabelled pool of data.
Now that we know what Active Learning is, let’s get into the most interesting part of how does this sampling of most informative images/data is done.
Problem: Finding most informative images from an unlabelled set:
Firstly, informative images can differ wrt. the task in hand i.e. classification, detection or segmentation.
Why does it differ?
Compare a classification task with a detection task — the informative images for detection depends on more specific local regions in images, as compared to the whole images that are used for classification.
Therefore, the definition of informative images changes depending on the task in hand.
Now how can we say that an image is most informative(or not) wrt. neural networks?
One way to find this out is, if the neural network is uncertain about its prediction for a given image, then the image can be termed as informative. Hence, in essence, the problem boils down to estimating the uncertainty in network’s predictions :)
But there is a catch here. How do we enable uncertainty estimation in neural networks as we already know that NNs softmax confidence is not reliable?(for more info on this topic, please check out my previous blog on OOD Detection)
Let’s look at the next section to find out how this is made possible!
Bayesian Neural Networks — Uncertainty Estimation in NNs
Solution:
- Consider an image classification network M already trained on a labelled set A and let there exists a unlabelled data pool B. This network was trained on A with dropout before every weight layer.
- Pick every image from pool B and pass it through the network M with dropout enabled at test time. Repeat the forward pass for the same image N number of times with different dropout probabilities.
- Repeat (2) for all images in the dataset B.
- Based on the network’s predictions for each image, a certainty value is calculated. This gives us insight into how informative the image is.
The above method is known as Monte-Carlo Dropout(MC Dropout).
But why does this above method help in estimating uncertainty?
The reason why Monte-Carlo Dropout would work is, as dropout is being applied to the network randomly for every forward pass(for the same image), the more differences in the output for each forward pass, the more confused the network is in mapping the incoming pattern to its respective class. This confusion(or uncertainty to be more specific) can be higher for unseen patterns or patterns not of the same distribution(pixel) from the train set.
Once the network’s predictions are obtained, the next step is to compute the certainty value based on the predictions.
Calculating prediction certainty:
Now that we are clear on what MC Dropout is, let us look at how we can calculate prediction certainty.
- For every image, do N forward passes with varied dropout probabilities and get the prediction probabilities for every pass.
- Compute the variance of these prediction probabilities across all N passes.
- If the variance is zero or below a threshold, the network’s predictions can be considered as certain.
As already mentioned above, a high variance(or in simple terms, more different) in the network outputs for the same image across multiple forward passes indicates that the image is highly informative.
*Note that I have not covered in depth about how this certainty value is calculated as well as other AL techniques such as Density based methods(Please check References section).
Certainty estimation, as seen till now is directly applicable to image classification tasks. When it comes to object detection or instance segmentation, one can consider not just the class outputs but also the localization(spatial) outputs to predict the certainty value.
Deep Active Learning(DAL) in a real-world setting:
We have seen Active Learning from a theoretical standpoint. Let us apply it in a real world setting and understand where it can actually make a difference.
Traditionally, DAL is applied as follows:
- Assume there is a model already trained for a particular task and there is a requirement to tag new unlabelled images to handle drift.
- Use this model to get a subset of highly uncertain samples(based on a sampling strategy) from the pool, pass it on to a manual labeler(oracle as mentioned in the below image) and create a new labelled train set A.
- Add this train set A to the pre-existing train set(on which the model was previously trained on) and retrain the model.
- Use the new train set A to update the unlabelled pool.
- Repeat the steps 2–5 for N number of iterations.
An optimized version of the above application can be achieved by introducing pseudo labelling step in between:
A more practical approach is to apply pseudo-labelling to the unlabelled pool first, followed by sorting pseudo-labelled images in the order of highly informative samples first, and then manually labelling them.
Active Learning aiding in Out-of-Distribution(OOD) Detection
I have talked about what out-of-distribution inputs are and how we could detect such OOD inputs wrt. neural networks in my previous article.
There are various ways to tackle OOD detection — right from label smoothing to training with an explicit outlier class. In addition, uncertainty estimation is another robust mechanism to identify such OOD inputs in neural networks.
But does uncertainty estimation also provide information about the incoming input pattern being totally unseen/not part of the training set?
No. In general, OOD can either mean the input is unseen/not part of the train set (or) a shift in the distribution of known patterns.
Uncertainty estimation, in general, gives information about an incoming sample based on the network’s prediction. Hence, any uncertain sample can be thought of as either an unknown pattern or a known pattern with distribution shift. But it doesn’t give information whether the pattern is known or unknown as such.
That’s all in this article on Active Learning and its application in real world scenario. I hope this gives a good intuition about what the technique is, where it can be applied and how the overall labelling time and network’s robustness can be improved :)
Cheers!