Neural Networks Intuitions: 12. Self-Training and MuST

Raghul Asokan
4 min readOct 3, 2021

Multi-Task Self-Training for Learning General Representations

Hello folks :)

It’s good to be back again after a long time. In this short edition of my series Neural Networks Intuitions, I would like to explain what self-training is and then talk about a very recent and interesting paper Multi-Task Self-Training for Learning General Representations which shows SOTA transfer learning results by combining both self-training and multi-task learning, beating supervised and self-supervised pretraining methods.

Before getting into the details of this paper, let us first try to understand what self-training is!

Self-Training:

Self-Training is the whole process of making use of a trained model to label data(pseudo-labels) and then retrain the model on labelled + pseudo-labelled data. Pseudo labelling is the process of generating labelled data using a trained model and the labels generated by the model are called pseudo-labels.

Why do self-training?

It is an important step that helps in optimizing the end to end data labelling process.

Let’s consider getting data tagged in multiple iterations(which is the usual case in a Machine Learning Project). One can train a model using the data tagged in first few iterations, train a model on that data, then later use that model to generate pseudo-labels on the new unlabelled set. This significantly reduces the time it takes to tag the new unlabelled images.

Is self-training only used to speed up data labelling or does it help in any other problems?

In addition to the above, Self-Training in general allows us to improve the overall model accuracy.

How does it improve model accuracy?

Assuming there is limited labelled along with abundant unlabelled data for a given task, one can train a model on the limited labelled data, generate more training data from the unlabelled set through pseudo-labelling and then retrain a model again on the original labelled set + the pseudo-labelled set. Naturally this produces more data for the network to train on, which inturn helps in better generalization(inspite of noise introduced by pseudo-labels).

This is all great but what does this have to do with MuST? Let’s look at it in the next section!

Multi-Task Self-Training for Learning General Representations(MuST):

Problem:

The most common approach to training a network(on any task) begins by making use of ImageNet pretrained weights and then finetune(the whole network or only the heads) the network on the downstream task. Later with the advancements in Self-Supervised Learning, they have been used to learn much general representations. This has also led to replacing the supervised ImageNet pretraining with SSL on ImageNet or even on downstream task — which gives better transfer learning performance.

But there are still issues with SSL, with the way they are trained and whether it can work well for a multitude of tasks and not just classification or detection.

Solution — MuST:

Inorder to learn more general representations that can transfer well across many different tasks(and not just one), the authors introduce MuST which trains a single multi-task model(student) by making use of supervised labels or pseudo-labels from specialized teacher models.

Algorithm:

  1. Consider four different tasks — classification, detection, segmentation and depth estimation — each task has a corresponding labelled dataset.
  2. Train four networks, one for each dataset — these are called specialized teacher networks.
  3. Now, for each of the above dataset, generate labels for all 4 tasks using both supervised labels and pseudo labels from teacher models. For eg: take ImageNet classification dataset — using teacher model trained on detection, segmentation and depth estimation datasets, generate detection, segmentation and depth pseudo labels for ImageNet(and use the existing supervised classification labels). Repeat the same for all 4 datasets.
  4. Train a single multi-task student network that is able to perform classification, detection, segmentation and depth estimation at once.

Using the student network from (4) as init for downstream tasks improves the performance, beating supervised and self-supervised pretraining methods.

MuST Overview(https://arxiv.org/pdf/2108.11353.pdf)

The following table shows how MuST pretraining beats ImageNet SSL and Supervised pretraining on variety of tasks

https://arxiv.org/pdf/2108.11353.pdf

They also show including pseudo-labels for different tasks improve the transfer learning performance for multiple tasks.

All the above results show that MuST can be a very effective pretraining, in comparison to supervised and SSL methods.

With more data and high variety of tasks, MuST should be able to learn more general representations and help in improving transfer learning performance much more.

Why does it work?

A multi-headed network(for multi-task learning) shares a common backbone. When such a network is trained to learn to discriminate among classes, to localize and segment objects in the image and to estimate depth, the representations learnt by the backbone will be more general.

The authors in the end also mention combining self-training and self-supervised learning can be a promising direction.

That’s all in this short article on MuST — a simple but an effective approach to pretraining. I hope you folks got a good idea of what MuST is and how to make use of it to boost the accuracy of neural networks.

Cheers!

--

--