Neural Networks Intuitions: 16. Segment Anything(SAM) — Paper Explanation

Raghul Asokan
7 min readApr 8, 2023

--

Hello everyone!

It’s good to be back again after a long break. I have been wanting to write more since my last article on ASH(for example: GPT, ViT, Flamingo, BLIP-2 or LLMs in general) but given my day-to-day office work, it is getting quite hard to track even the recent SOTA advances in Vision and NLP community, let alone writing articles ;-). And as someone had rightly stated on Twitter that “Instead of asking for a 6-month moratorium of training Large Language Models(LLMs), ask for a 6-week pause on releasing (weekly) SOTA advances in Deep learning :D”

Now coming back to my series “Neural Networks Intuitions”, this time I will be covering a very recent SOTA object segmentation approach devised by FAIR — Segment Anything Model(SAM) that not only excels at generic object segmentation but also talks about the key aspects such as “model-in-the-loop data annotation”(and how supervised learning/labelled data beats any other approach on any given day) and also enables interactive applications in AR/VR.

As usual, let’s first understand the core problem that we are dealing with and what was the solution proposed :)

Problem Statement:

Consider the problem of segmenting every object present in an image with the help of neural networks.

What are the possible ways of achieving this?

Label every object that is present in the image and train an object segmenter network on that data.

Okay, this seems fine. But will this approach work for unseen or new distributions that were not part of the training?

Mmmm, that’s going to be quite difficult given we know what neural nets are actually capable of and can only learn to map inputs for what it has been trained for.

But what if we follow the trend of increasing the scale of data, network and compute — just as it has been done for LLMs(such as GPT-3.5, GPT-4 etc)?

Would the neural network be able to learn the general notion of what an object is(i.e. more precisely its boundaries) in an image?

The only way to truly know if the network can actually learn this pattern of object in general, is by training a large scale object segmentation network on a huge dataset and observe its zero shot capabilities.

And this approach is exactly(almost) what SAM follows!

The above approach looks like a promising direction but still doesn’t ensure if the network can generalize to unseen inputs. What SAM does different is by extending CLIP’s idea of zero shot transfer(with text input prompt) and uses various types of input prompts to generate output masks.

Let’s dive deep into the details of SAM, how it has been trained and its zero shot capabilities :)

Solution — SAM:

Segment Anything Model(SAM) is an encoder-decoder based(transformer) architecture that learns to segment any object in an image given a prompt — that can either be a point, bbox, mask or even text.

In the paper, the authors address three key questions:

  1. What task will enable zero-shot generalization?
  2. What is the corresponding model architecture?
  3. What data can power this task and model?

Let me talk about each of the above three aspects in detail as they encompass the core of what SAM is and why it works in practice.

1. Task:

In the recent times, we have clearly observed that LLMs have the zero shot or few shot transfer capabilities when trained on large data and using large compute.

Hence there has been a recent trend of unimodal(CLIP) and multimodal(BLIP) vision architectures proposed to solve vision tasks with the help of text.

More precisely,

  1. CLIP reformulates the problem of image similarity search(or retrieval) into a input text, retrieve image output form that takes learns not just image features but also takes in context text features as well.
  2. BLIP solves the problem of image captioning or visual question answering through LLMs where an LLM decoder(which takes text and image embeddings together) produces text captions.

Inspired from the above ideas, SAM reformulates the task of object segmentation(input image, output mask) into providing an input prompt — (point, bbox, text) and outputs object masks(hence the name promptable segmentation task).

2. Model Architecture

SAM consists of an image encoder, a fast prompt encoder and a mask decoder.

Input: Image + Prompt(point, bbox, mask or text)

Output: Object Masks

SAM Model Architecture

a. Image encoder:

The image encoder block used is a Masked Autoencoder(MAE) pretrained Vision Transformer(ViT) that takes in an input image and produces image embeddings. Given the nature of SAM’s promptable image segmentation, the image embedding can be generated once and can be run multiple times with multiple input prompts.

b. Prompt encoder:

The prompt encoder block considers two types of input prompts — sparse(point, text, bbox) and dense(mask).

Points and bboxes are represented by positional encodings and for text representation, the authors make use of CLIP’s text encoder. Dense prompts(masks) are embedded using convolutions and summed element-wise with the image embedding.

c. Mask decoder:

The mask decoder block is a Transformer decoder that takes in image embeddings, prompt embeddings and an output token and produces a mask output.

3. Data

Data is the key aspect in SAM’s ability to perform exceedingly well to unseen input distributions. As stated in the paper, the newly open-sourced SA-1B, consists of 11M diverse, high resolution, licensed, and privacy protecting images and a total of 1.1B high-quality segmentation masks(collected with their data engine). Note that there are no class labels as such, every mask is considered to be a single common “object” class.

The very first step is data collection where the authors make use of a photography company to collect images across the globe(encompassing numerous countries). Below image shows the diversity of the images present in SA-1B dataset.

In addition to the diversity, it is the scale of the data labelled — number of images as well as number of object masks that drives SAM to become a foundation model.

Data Engine:

We looked at the dataset statistics and its scale in the above section, but what does it really take to get such a large dataset annotated with such high quality?

The data engine which is the major differentiator, has three stages:

  1. model-assisted manual annotation stage: in this first manual stage, pretty much all of the annotation had been done by trained annotators with assistance from SAM(pretrained on previous large datasets).
  2. a semi-automatic stage with a mix of automatically predicted masks and model-assisted annotation: As more and more images are tagged, SAM is trained on them. This stage also involves collecting more diverse images for data tagging — hence mask confidence was used to detect images containing less prominent objects and annotators were asked to tag them.
  3. a fully automatic stage: In this stage, all of the annotations are auto generated with the help of SAM.

This data engine is the real core piece that enables SAM to be a foundation model. The process of collecting diverse, large scale images in the first place to using model-in-the-loop data annotation strategy to build large scale datasets is something that all AI based startups and large companies can take inspiration from.

Zero Shot Capabilities of SAM:

Once SAM is trained on SA-1B dataset, inference can be done in 3 ways:

  1. single point based: given a single input point as prompt, SAM takes in the corresponding point’s positional encoding along with image embedding and generates a best unambiguous mask for the point.
  2. bbox based: instead of just a single point, user can input bbox(one or more) and similar to point-based, SAM produces the best mask for the given bboxes.
  3. automatic(grid of points throughout the image): because SAM hasn’t been trained to just produce a mask given an input image(but instead provided with points, bboxes as input), there has to be a way to generate masks for the entire image in the absence of a user prompt(which is the ideal scenario). To achieve this, a grid of 32x32 points are prompted as input(distributed throughout the image), masks generated and duplicates removed with the help of NMS.
  4. text based: although this prompt is not fully robust and yet to be explored fully, SAM allows for a text prompt and an output mask is generated referring to the text(for eg: cat prompt produces a mask around the cat). This is made possible by using CLIP’s image and text embeddings.

Overall, the authors show that SAM does perform well on unseen datasets and is comparable to its supervised counterparts.

Applications of Promptable Segmentation task:

Given SAM’s nature of promptable segmentation, it can be applied in real time in AR applications and also data annotation tools to provide realtime segments/masks that can aid in quicker labeling.

Limitations of SAM and what’s next:

SAM definitely performs well to new, novel datasets and shows strong generalization, but the major part of its success can be attributed to its large scale supervised training. This is one another example that clearly shows supervised training(not just noisy web data, but clean large scale labelled data) wins any day.

One of the key next steps would be to see how well SAM can be extended to text prompts. Given the recent success of LLMs and how well they are generalizing to novel tasks, it would be quite interesting to reformulate detection and segmentation tasks with the help of text.

That’s all in this article on SAM and why large scale supervised training shouldn’t be overlooked. As stated by authors, whether or not to consider SAM as a foundation model is left to the community and how it is being applied/extended to various downstream tasks(similar to CLIP).

I hope I have given a clear gist of what SAM is and its applications. Will soon be back with a Vision-Text based SOTA paper in my next article. See you folks until then. Cheers :)

--

--

Raghul Asokan
Raghul Asokan

Responses (1)