Neural Networks Intuitions: 17. BLIP series — BLIP, BLIP-2 and Instruct BLIP— Papers Explanation

Raghul Asokan
13 min readAug 2, 2023

Hello Everyone!

In this article in my series “Neural Networks Intuitions”, I will be talking about one of the interesting yet challenging area of building general purpose vision algorithms that can generalize to various diverse, unseen tasks. The last few months have been great for NLP community given the recent success of GPT-4 models which shows the true power of Large Language Models(LLMs) and its strong emergent capabilities to solve novel tasks which are indeed not part of the training set.

The recent version of ChatGPT(with GPT-4 backend) is multi-modal allowing image inputs to be fed and described, but there is not much visiblity into the data it has been trained nor its important architectural aspect(I haven’t read LLaMA-2 paper yet :)) — thereby making it a blackbox. But on the other side, there are a stream of great papers published by SalesForce which excels at various vision-language tasks such as image captioning, visual question answering, image to text retrieval etc.

These set of papers, under the common umbrella — BLIP(Bootstrapping Language-Image Pretraining), lists down novel ways of unifying image and text features, instruction-tuned LLMs with image inputs and dataset bootstrapping. BLIP series of algorithms exhibit strong SOTA performance on a wide range of visual-language tasks.

Let’s dive deep into all of these ideas and understand the core concepts better!

Overview:

Essentially, BLIP series started off with the paper (1). BLIP — Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation that introduced a novel way of incorporating image features into text which enabled better unified vision and text representation learning.

Nextly, (2). BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models introduced an efficient strategy to make use of off-the-shelf image encoders and LLMs for vision language pretraining without affecting these models’ generalization ability.

At last and the more recent paper, (3). InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning focused on the instruction tuning aspect and showed that instruction tuned LLMs coupled with image inputs exhibits much better zero shot capabilities than the multi-task networks or the unimodal/multi-modal networks without instructions.

But before directly jumping into these papers and their architectural details, let us first take a step back, understand what is the problem and why BLIP can serve as an effective solution for the same :)

Problem Statement:

Building general purpose Computer Vision models for various vision tasks such as image classification, detection, segmentation has been a long standing problem and quite a challenging one to say the least.

In the advent of solving this problem, there were numerous approaches/algorithms that came up — some focusing on the data aspect while some of them on the algorithm aspect. And a few years ago, a very cool research around tying image and text features with a contrastive learning objective — CLIP, made significant advancements in image recognition(to start with especially image classification/retrieval).

CLIP was one of the first papers(if not the first) to successfully incorporate text features for vision task(unimodal), performing large scale pretraining and showed some strong zero shot performance on a variety of unseen tasks.

While there were various attempts made to unify image and text features(unlike CLIP which is unimodal meaning it pretrains and makes use of image and text features independently), there were also some major breakthroughs in NLP where Large Language Models(LLMs) were exhibiting excellent text generation capabilities and the standout was their emergent abilities to generalize to novel/unseen inputs or tasks.

Solution:

Naturally, there arose a need for a unified vision and language framework and on top, a framework to couple them with LLMs which enables vision solutions to exploit LLMs’ strong generalization ability.

BLIP family is one among many recent approaches that proposes a framework for a robust & accurate image-text understanding which leads us to general purpose vision models that doesn’t require much of finetuning.

BLIP

Motivation:

There have been recent approaches that works well separately on image-text understanding(multi-modal representation learning for retrieval) using encoder based models(like CLIP) or text generation tasks given images(like image captioning) using encoder-decoder models. However applying encoder based models to text generation tasks or encoder-decoder models to image-text retrieval tasks haven’t been very successful.

What if there is one unified framework that can solve both image-text retrieval tasks as well as text generation tasks conditioned on image?

BLIP, precisely, solves this problem by defining parallel shared branches and multi-task pretraining objectives that excels at vision-language understanding and vision language modeling tasks.

In addition, most of the vision-language understanding tasks makes use of large scale noisy image-text pairs from the internet which is sub-optimal and has an effect on the accuracy.

Therefore, BLIP also proposes a Captioner-Filter framework that generates highly accurate synthetic captions with the help of Self-Distillation.

Now that we are clear on what BLIP is solving, let’s look in depth about the architecture proposed to achieve it.

*Unimodal: features learned by making use of only one mode. for eg: image embeddings or text embeddings that separately comes out of CLIP.

*Multimodal: features learned by making use of multiple modes. can be using image + text, text + audio etc.

Architecture:

The proposed architecture is as follows:

  1. a unimodal image and a text encoder that takes in image and text as inputs(separately) and generates image and text embeddings in isolation.
  2. an image-grounded text encoder that takes in text as input(along with visual features injected via cross attention from unimodal image encoder) and generates multi-modal representations(combining image and text).
  3. an image-grounded text decoder that takes in a [Decode] token(along with visual features injected via cross attention from unimodal image encoder) and generates textual descriptions for the given input image.
Architecture as in BLIP paper

Let’s take a look at the pretraining objectives that is concerned with each of the modules:

  1. Image-Text Contrastive Loss(ITC Loss): similar to CLIP, the encoders are trained to generate similar representations for similar image and text pairs and different representations for negative input pairs.
  2. Image-Text Matching Loss(ITM Loss): this helps the text encoder to learn multi-modal representation that captures the fine-grained alignment between vision and language. This is a binary classification task where the loss outputs 1 for a positive image-text pair and 0 for a negative pair.
  3. Language Modeling Loss(LM Loss): this loss helps the text decoder generate text descriptions for the corresponding input image.

All the three modules are trained jointly with these pretraining objectives.

One thing to note here is, all of the parameters are shared across text encoder and decoder except for self attention layers since the representations best suited for encoding and decoding tasks are captured part of the SA layers respectively.

A question I had when reading the paper was “why do we need ITM objective where just ITC and LM losses alone would have sufficed to jointly learn effective visual-language representations and a text generation model?”

One way to look at it is, ITM loss firstly helps both ITC and LM objectives. Next, because the parameters are shared across the text encoder/decoder, ITM objective forces more informative visual features to be extracted for language generation task.

A drawback with ITM though is its computation bottleneck — since the image and text embeddings are jointly generated, when it comes to matching representations for a set of image and text pairs, it has to be done for each and every pair(one pass/pair) and cannot be precomputed like in ITC task.

Captioning and Filtering(CapFilt):

One more module that helps in improving BLIP’s accuracy is to make use of the model’s text decoder block(captioner) and image-text matching block(filter) to work in tandem to label and filter noisy web data and retrain the network(ITC, ITM and LM) back with the same web data which are accurate than before.

In simple terms(self-distillation),

  1. pretrain with noisy web data + small but accurate human labeled data
  2. finetune ITC, ITM and LM back with human labeled data
  3. generate newer captions using LM on web data, filter using ITM and use them again to train your end-to-end network(containing ITC, ITM and LM).
Bootstrapping of Captioning dataset

As much as the performance can be attributed to BLIP’s architecture, the CapFilt framework gives a significant boost in accuracy and shows that not just large scale data but cleaner and accurate large scale data is key.

Results:

BLIP vs CLIP on image-text retrieval task

BLIP-2

Motivation:

In the previous section, we read how BLIP proposes an efficient architectural scheme and training objectives to effectively train multi-modal networks, thereby solving visual understanding and text generation(given image) tasks.

But, there exists a challenge when it comes to training such language models on large-scale datasets, where pretraining of such Visual Language Models(VLMs) are particularly expensive.

Can we take off-the-shelf pretrained frozen image encoders and frozen large language models(LLMs) and use them for vision language pretraining while still preserving their learned representations?

One thing to note here is that LLMs have not seen images during their unimodal pretraining and freezing them makes vision-language alignment challenging.

BLIP-2, solves this problem by introducing a Query Transfomer that helps generate the most informative visual representation corresponding to a text caption(from the frozen image encoder), which then is fed to a frozen LLM to decode accurate text descriptions.

Architecture:

Naturally, BLIP-2 helps in bootstrapping Vision-Language Representation Learning from a Frozen Image Encoder(in stage-1) and bootstrapping Vision-to-Language Generative Learning from a Frozen LLM(in stage-2) with the help of Query Transformers(or Q-former).

Q-former is the trainable module that bridges the modality gap between a frozen image encoder and a frozen LLM.

Let’s take a look at both the pretraining stages in detail :)

Stage 1: Bootstrap Vision-Language Representation Learning from a Frozen Image Encoder:

BLIP-2 — Stage 1 Vision-Language Representation Learning

The first pretraining stage in BLIP-2 is to learn rich vision-language representations from a frozen encoder that can aid in better generative learning(in stage-2).

Essentially, the flow is as follows:

  1. An input image is passed through a frozen image encoder(any pretrained vision transformer).
  2. Q-former interacts with the frozen image encoder and generates visual representations that most relevant with text. Q-former consists of a image transformer and a text transformer which are trained using the same set of objectives/tasks(ITC, ITM and LM losses) as in BLIP,

One intuitive way to look at Q-former is by asking the question “How can one extract visual features from an already trained vision encoder that relates more to the given text caption in a generic and compute efficient way?”

A lightweight image+text transformer(i.e. Q-former) which interacts with the frozen image encoder’s features through cross attention layers and shared self-attention layers(across image and text transformers) can filter out rich visual features that are most relevant to a given text.

Stage 2: Bootstrap Vision-to-Language Generative Learning from a Frozen LLM:

Now that we have an efficient way(in terms of compute and accuracy) to learn vision-language representations from a frozen image encoder, let’s now look at the second and the important stage of generative learning while keeping the (frozen) LLM’s capabilities.

BLIP-2 stage 2 Generative pretraining

In the second stage, the Q-former’s visual representations are passed to an FC layer which projects it to a dimension same as LLM’s input text embedding dimension.

The flow is as follows,

  1. input: image, output: textual description of the image
  2. input image is passed through the frozen image encoder, followed by Q-former’s image transformer to spit out the output visual representation Z.
  3. output image embedding is projected to a dimension same as the text embedding of LLM.
  4. the projected query embeddings are then prepended to the input text embeddings. They function as soft visual prompts that condition the LLM on visual representation extracted by the Q-Former.
  5. two kinds of LLMs are being trained:

a. decoder based to generate text conditioned on Z (uses language modeling loss).

b. encoder-decoder based — visual embedding is prepended to the text embedding and then conditioned to generate text(using a prefix language modeling loss).

In stage-2, Q-former along with the projection layer is trained end-to-end(keeping image encoder and LLM frozen).

Now there is a key question that arises: “this whole Q-former module has been introduced to exploit the generalization capabilities of pretrained image encoders and LLMs. are they helpful at all?”

As a matter of fact, the authors make a promising observation that a stronger image encoder or a stronger LLM both lead to better performance and actually beats the previous BLIP arch on several tasks.

This is quite cool! This enables others to reuse open-source models(be it vision or language models) trained on a huge scale and finetune it on a certain domain specific data which can retain the generalization capabilities of the former models.

So, if stage-2 is what does generative pretraining(which is what we want), then why have stage-1 in the first place?

As mentioned in the paper, the first-stage representation learning pre-trains the QFormer to learn visual features relevant to the text, which reduces the burden of the LLM to learn vision-language alignment. Without the representation learning stage, Q-Former relies solely on the vision-to-language generative learning to bridge the modality gap, which is similar to the Perceiver Resampler in Flamingo

Instruct-BLIP

Let us take a step back from BLIP/BLIP-2 methods and look at the recent trends on chat bots powered by LLMs.

Motivation:

We see LLMs show zero shot generalization when it comes to text only data (input & output) and instruction tuning is what substantially improves its zero shot capability. For eg: GPT3.5, GPT-4, LLaMA-2 etc.

When it comes to vision tasks and especially vision-language tasks, there are two main approaches that we have seen till now:

  1. one that uses joint representation learning
  2. another that uses LLMs coupled with ViTs — such as BLIP-2, Flamingo.

Therefore, there arises a need for an instruction tuning framework which deals with multiple modes: text as well as image. This is a key step in building general purpose visual recognition models.

What is Instruction Tuning though? How different is this from image captioning or visual question answering containing image, text pairs?

Instruction tuning(in VLP) is about construction of image-text data(be it image captioning or VQA) in a conversation like manner. InstructBLIP lists down some set of instruction template that has been formulated for every task involving question and an answer in a predefined format.

InstructBLIP uses the same architecture as BLIP-2, in that it consists of a frozen image encoder, frozen LLM and a Q-former to extract informative visual features wrt. the text — where the text here is an instruction format and not just plain captions.

If one takes a closer look on BLIP-2’s working, it already behaves somewhat similar to a chatbot that takes in an image and produces some captions — but what does it take to make it behave like an actual chatbot that can read images?

One more level of improvement over BLIP-2 is constructing the data to be in the form of instructions and also enabling Q-former to generate instruction aware visual extraction — which helps it generalize and adapt as per the input instructions.

Architecture:

InstructBLIP Architecture

As we can see, InstructBLIP is very similar to that BLIP-2, but the dataset has been constructed to be instruction based and the same instructions are also fed to Q-former to product instruction aware visual representations. Specifically, the textual instruction is given not only to the frozen LLM, but also to the Q-Former, so that it can extract instruction-aware visual features from the frozen image encoder.

Datasets used:

Comparison with BLIP-2 and Flamingo:

As we can see in the above table, Instruction tuning certainly helps in improvement of zero-shot generalization.

Other Instruction tuned LLMs:

The paper has also cited several instruction-tuned visual-language models such as MiniGPT, LLaVA, mPLUG-owl, MultiInstruct which also produces comparable results as InstructBLIP.

Summary:

Based on the series of papers under BLIP, two things I would like to highlight is,

  1. making use language along side images to push towards general purpose vision algorithms.
  2. It is not just the scale of the data that matters, but also how cleaner the data is(CapFilt) and feeding it data in the form of instructions(instruction tuning in InstructBLIP).

And as we see it is not just models like the VLMs that helps us move towards generic vision algorithms but some stream of interesting research around image inpainting(for eg: Painter, SegGPT, PersonalizeSAM) that are also picking up.

Applications:

BLIP/BLIP-2/InstructBLIP has shown some strong results on image-text retrieval tasks, especially its performance under zero-shot settings.

That’s all in this article on BLIP series. The architecture, pretraining objectives and instruction tuning proposed in these papers are of great significance and presents a working recipe to building general purpose vision models. I hope I was able to provide a good gist and understanding of these papers from an intuitive standpoint.

See you soon in my next article :)

Cheers!

--

--