Swin Transformer Hierarchical Vision Transformer

本文主要是介绍Swin Transformer Hierarchical Vision Transformer，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Tags: Swin Transformer
发表日期: 2021
星级 : ★★★★★
模型简写: Swin Transformer
简介: 多层次的Vision Transformer，提出基于窗口（移动窗口的多头自主意力机制）每次先做一次W-MSA, 再做一次SW-MSA
精读: Yes

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021最佳论文)

Transformer和CNN的完美结合

Motivation: unification story for AI (NLP and CV)，追求unified architecture (ViT更适用)

Unification: graph neural networks, self-attention

NLP, CV的统一建模

transformer: 基于图的建模

general representation与domain knowledge的结合

ViT：大力出奇迹

在ViT基础上结合CV characteristics (good priors for visual signals):

Hierarchy, locality, translation invariance

Introduction

general-purpose vision backbone 通用骨干网络（多尺寸特征），密集预测任务中多尺寸特征是至关重要的。

在这里插入图片描述

Linear computational complexity: The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection.

Locality: shifted windows, local self-attention.

Hierarchy: Patch Merging.

在这里插入图片描述

shifted windows:

cross-window connection: 局部自注意力变相等于全局自注意力

Model Architecture

4个stage

在这里插入图片描述

Patch Partition：转为patch 224x224x3 → 56x56x48，下采样率4

Linear Embedding: 向量维度转为transformer可以接收的值，56x56x48 → 56x56x96 → 3136x96 C=96 / Patch Partition + Linear Embedding == Patch Projection (ViT)

Patch Merging: 下采样两倍，空间维度换通道数HxWxC → H/2W/2x4C → H/2W/2*2C; 空间大小减半，通道数加倍，跟卷积网络对应。56x56x96 → 28x28x192 self-attention+Patch Mergin == CNN + Pooling

Architecture Variants:

Swin-T: C=96, layer numbers = {2, 2, 6, 2} Tiny == resnet50

Swin-S: C=96, layer numbers = {2, 2, 6, 2} small == resnet101

Swin-B: C=96, layer numbers = {2, 2, 6, 2}

Swin-L: C=96, layer numbers = {2, 2, 6, 2}

shifted Window based Self-Attention

The global computation leads to quadratic complexity with respect to the number of tokens. 全局自注意力机制会导致平方倍的计算复杂度。

Compute self-attention within local windows.

$\Omega(MSA)=4hwC^2+2(hw)^2C\\\Omega(WMSA)=4hwC^2+2M^2hwC$

Window bases self-attention比Global self-attention计算复杂度低，但是却丧失了全局建模的能力。

Shifted window partitioning in successive blocks

在这里插入图片描述

$\hat{z}^l=WMSA(LN(z^{l-1}))+z^{l-1}\\z^l=MLP(LN(\hat{z}^l))+\hat{z}^l\\\hat{z}^{l+1}=WMSA(LN(z^l))+z^l\\z^{l+1}=MLP(LN(\hat{z}^{l+1}))+\hat{z}^{l+1}$

Efficient batch computation approach for self-attention in shifted window partitioning

Mask in shifted window attention (Masked MSA, masked multi-head self-attention) 七巧板？

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Ekd8FGTq-1659576646165)(Swin%20Transformer%20Hierarchical%20Vision%20Transformer%20u%20a786a9434ac34e75b596da0d14d7a0c3/Untitled%204.png)]

掩码可视化：（self-attention结果与mask相加）

在这里插入图片描述

Experiments

pretrained on

ImageNet1k : 1.28M images. 1k classes

ImageNet22k: 14.2M images, 22k classes

3 datasets to cover various recognition tasks of different granularities

Image-level : ImageNet-1K classification (1.28million images; 1000 classes)

Region-level: COCO object detection (115K images, 80 classes)

Pixel-level: ADE20K semantic segmentation (20K images; 150 classes)

3 levels of comparison

System-level comparisons （不追求公平比较，极致性能，MMA）

Backbone-level comparison

Verify the effectiveness of crucial designs

s实验部分是顶级的，全方位碾压之前的模型

统一性做到极致甚至是可以两个模态共享模型参数，而不只是架构一样，当然通常来说，不一定要做到这个程度，例如处理裸的信号时，前面几层通常时不用share的。更多应用里，像Swin这样采用Transformer这样的模块，已经能将两个模态的训练方式统一起来，并相互借鉴经验，已经很好了。另一方面，Swin里的一些特性，其实还可以反过来用到NLP里，这样也是可以达成您提到的统一性的。

Swin Transformer V2

Swin Transformer V2: Scaling Up Capacity and Resolution

pre-norm下，激活层差异随着模型加深变大，与浅层特征有很大的gap，并导致训练的不稳定性；
使用余弦相似度代替内积相似度，改善因为某些特征过大而主导attention的情况，因为余弦函数本书就相当于归一化的结果
log-spaced CPB，有助于windows-size的扩展。

在这里插入图片描述

基于Swin Transformer的3个改进点

Post-normaliztion后归一化技术，在self-attention layer和MLP block后进行layer normalization
Scaled cosine attention代替dot production attention，使用余弦相似度计算token pair之间的关系
Log-spaced continuous position bias，对数空间连续位置偏置技术

scaled cosine attention：

$Sim(q_i,k_i)=cos(q_i,k_i)\tau+B_{ij}$

Log-spaced CPB

Motivate: Degraded performance when transferring the models across window resolutions. On a larger image, window size by the bi-cubic interpolation approach, the accuracy significantly drops.

continuous relative position bias:

$B(\Delta{x},\Delta{y})=\varphi(\Delta{x},\Delta{y})$