Importance of Attention Modules in Computer Vision

6 min readDec 25, 2020

“The difference between something good and something great is attention to detail”— Charles R. Swindoll

This article is mainly about the usage and advantages of attention mechanisms in various computer vision tasks.

We will cover —

What is Attention?
Why Combine Channel & Spatial Attention?
Basic Idea of SE -Net
CBAM in Detail
SOTA

Focusing or paying attention is something we practice daily in our day to day lives. Like now, when you are reading this article, your eyes are focused only on this particular line while other parts displayed on the page are partially blurred. So you are paying attention to read what’s written here. Now, look at your room where you see many things like a chair, table, clock, etc. In this case, you see and pay attention to many things at the same time.

Attention helps to focus on a particular object at a specific location and captures rich, detailed features of that object. These diverse features are distinct as well as discriminative in nature. In simple words, attention is like the Bokeh in photography, where the target object is focused, and the background is blurred.

The attention mechanism first gained popularity with Natural Language Processing (NLP). The idea was introduced in a paper titled “Attention Is All You Need” presented by Google Brain in 2017. They have been widely used in sequential models with recurrent neural networks (RNN) and long short term memory (LSTM). Gradually deep learning researchers have started incorporating the mechanism into the famous CNN baseline architectures (VGGNet, InceptionNet, ResNet, Xception, DenseNet, MobileNetV1, SE-Net) to further improve computer vision tasks like object detection, segmentation, etc.

Channel Attention & Spatial Attention

Channel attention focuses mainly on ‘what’ is informative, given an input image. Whereas spatial attention emphasizes ‘where’ the meaning features are located. From a spatial viewpoint, the channel attention is globally applied, while the spatial attention works locally. Combinedly, both improve the precise localization as well as the classification ability of the model.

Squeeze and Excitation Network (SE-Net)

We briefly discuss (SE-Net) as this is the winning model of the ILSVRC 2017 classification task. This model introduces the squeeze and excitation blocks (SE Blocks) that act as channel attention module. These blocks can be integrated with the existing models without increasing the computational cost. Global average pooling operation is applied to produce channel-wise information. Then a simple gating mechanism is introduced using the sigmoid function to obtain the channel-wise interdependency. The SE blocks are used to excite only the important feature maps and discard the unnecessary ones, which reduces the network’s computational burden to a great extent.

Basic Operations in SE-block [**Source**]

SE-bocks plugged among CNN baseline architectures [**Source**]

SE-blocks emphasizes only channel-wise attention and overlooks the spatial features, which decide ‘where’ to focus on getting rich information.

CBAM — Convolution Block Attention Module

This module simultaneously highlights meaningful features, both channel-wise and spatially, by suppressing the unnecessary noise. CBAM enhances the representational power of baseline CNNs by learning ‘what’ and ‘where’ to emphasize or suppress. CBAM is a light-weight module, so it is easily plugged into any network for further performance improvement. The basic concept of CBAM was presented in the SCA-CNN (based on image captioning) paper.

The channel attention module (CAM) and spatial attention module (SAM) are the two distinct sub-modules in CABM for feature refinement. The official Pytorch code for CBAM is publicly available, so you can also try this amazing module in your network and see how it work.

Here intermediate feature is represented as F ∈ ℝ^C×H×W (C=channel, H= height, and W= width), and F′′ is represented as the final refined output. The ⊗ symbol denotes element-wise multiplication. The entire process of attention can be summarised as:

CAM and SAM can be put together either parallelly or sequentially, but the later arrangement produces better results.

In CAM, max pooled and average pooled (Fcavg and Fcmax) features are coupled and forwarded to a shared network called multi-layer perceptron (MLP) to generate finer channel attention map Mc. Then the output feature vectors are merged using element-wise summation. Channel attention is computed as:

where σ denotes the sigmoid function. . W_0 and W_1 are MLP weights. This Mc(F) is element-wise multiplied with F to form F’. So channel attention focuses mainly on ‘what’ is informative, given an input image.

The output of CAM, coupled with input feature maps, is fed to SAM to improve the proposed model’s localization capability. With pooling and convolution operations, spatial attention map Ms is generated to highlight the regions with detailed information further.

Here filter size of 7*7 is used. Applying pooling operations along the channel axis is shown to be effective in highlighting informative regions. So SAM focuses mainly on ‘where’ the contextual information is localized.

CBAM integrated with a ResBlock in ResNet [**Source**]

CBAM plugged in between Residual blocks in ResNet

Even though the SE block looks very similar to CAM, but CABM outperforms the powerful SE-Net model. The major drawbacks of SE-Net are-

SE block ignores the spatial features, which degrades the precise localization capability of the model.
SE-Net uses Global Average Pooling (GAP) to extract channel-wise features instead of max-pooling. The study finds that max-pooling preserved richer and sophisticated features like edges and textures, improving model performance.

The experiments show that GAP produces more refined and smooth features as it includes generalized contextual information from the intermediate features. On the other hand, max-pooling captures contextual features but fails to preserve localization information. So, the authors of CBAM utilizes both GAP and max-pooling to produce refined features. CBAM outperforms most SOTA (ResNet, WideResNet, ResNext, SE-Net, MobileNet) using both GAP and max-pooling.

SOTA

Classification results on ImageNet-1K [**Source**]

With minimum overhead in terms of both parameters and computation, CBAM can be efficiently used with the light-weight network, MobileNet.

Classification results on ImageNet-1K using the light-weight network, MobileNet [**Source**]

The above results show that the CBAM is a powerful method that outperforms using the new pooling technique.
CBAM produces richer descriptor and spatial attention that complements the channel attention effectively.
Finally, CBAM boosts the accuracy of baselines considerably and also fairly improves the performance of SE-Net. This shows the significant potential of CBAM for applications on low-end devices.

These are some great papers for you to explore based on attention mechanisms: Residual Attention Network, Block Attention Module, Squeeze-and-Excitation Networks, Attention to Scale.

Thank you for reading. If you find this post helpful, share it, and clap(s) would hurt no one.