EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications

本文主要是介绍EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

摘要

众所周知，模型的准确性很大程度上可以评判一个模型的优劣。为了提高模型的准确性，人们往往会设计复杂的神经网络来提高准确率。然而，模型越复杂，需要的计算资源就越大，这就导致了大模型无法部署在例如手机等边缘设备上。本文一方面是为了结合CNN和Transformer的优势，一方面是为了使得模型能在边缘设备上部署，提出了EdgeNext。

介绍

CNN和Transformer的对比

CNN具备局部感受野，无法对全局信息进行建模。
CNN学习到的权重在推理过程中是静态的，无法灵活的适应输入的内容。
Transformer可以缓解这个问题，但是计算的复杂度太高，对边缘设备不友好。

贡献

提出了SDTA（Split depth-wise transpose attention），可以有效的增加全局和局部表示。最主要的是不增加参数和乘法操作。

模型	介绍
MobileNet	使用深度可分离卷积构建轻量型深度神经网络。（深度卷积（过滤特征）和点卷积（组合特征））
ShuffleNet	使用通道混洗操作和低成本组卷积。（逐点组卷积（降低计算成本）和通道混洗（不同组之间交换信息）
MobileFormer	MobileNet+Transformer的并行设计，从而实现局部特征和全局特征的融合。（mobileNet+Vision Transformer）
MobileViT	使用transformer作为卷积来学习全局表示，传统卷积可以分为展开、矩阵乘法（局部表示）、折叠三部分。
ConvNeXt	对ResNet进行改造，达到了Swin Transformer的效果

总体架构

在这里插入图片描述

Convolution Encoder

类似于深度可分离卷积，使用每个阶段大小可变的深度卷积来丰富局部表示。
然后使用1*1的卷积进行特征组合，进行不同通道上特征信息的交互。
空间混合（深度卷积）和通道混合（点卷积）

卷积编码器

SDTA Encoder

split通过编码输入图像中的各种空间级别来学习自适应多尺度特征表示。
自注意力的计算在通道维度上进行，大大减少了计算的复杂度。
SDTA

代码解析

EdgeNeXt有两种实现。一种是LayerNorm和GELU的，一种为Hard-Swish和BatchNorm。主要以介绍LayerNorm和GELU的ConvEncoder和SDTA。

ConvEncoder

class ConvEncoder(nn.Module):def __init__(self, dim, drop_path=0., layer_scale_init_value=1e-6, expan_ratio=4, kernel_size=7):super().__init__()# 空间混合self.dwconv = nn.Conv2d(dim, dim, kernel_size=kernel_size, padding=kernel_size // 2, groups=dim)self.norm = LayerNorm(dim, eps=1e-6)# 通道混合self.pwconv1 = nn.Linear(dim, expan_ratio * dim)self.act = nn.GELU()self.pwconv2 = nn.Linear(expan_ratio * dim, dim)self.gamma = nn.Parameter(layer_scale_init_value * torch.ones(dim),requires_grad=True) if layer_scale_init_value > 0 else Noneself.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()def forward(self, x):input = xx = self.dwconv(x)x = x.permute(0, 2, 3, 1)  # (N, C, H, W) -> (N, H, W, C)x = self.norm(x)x = self.pwconv1(x)x = self.act(x)x = self.pwconv2(x)if self.gamma is not None:x = self.gamma * xx = x.permute(0, 3, 1, 2)  # (N, H, W, C) -> (N, C, H, W)x = input + self.drop_path(x)return x

SDTA(Split Depth-wise Transpose Attention)

class SDTAEncoder(nn.Module):def __init__(self, dim, drop_path=0., layer_scale_init_value=1e-6, expan_ratio=4,use_pos_emb=True, num_heads=8, qkv_bias=True, attn_drop=0., drop=0., scales=1):super().__init__()width = max(int(math.ceil(dim / scales)), int(math.floor(dim // scales)))self.width = widthif scales == 1:self.nums = 1else:self.nums = scales - 1convs = []for i in range(self.nums):convs.append(nn.Conv2d(width, width, kernel_size=3, padding=1, groups=width))self.convs = nn.ModuleList(convs)self.pos_embd = Noneif use_pos_emb:self.pos_embd = PositionalEncodingFourier(dim=dim)self.norm_xca = LayerNorm(dim, eps=1e-6)self.gamma_xca = nn.Parameter(layer_scale_init_value * torch.ones(dim),requires_grad=True) if layer_scale_init_value > 0 else Noneself.xca = XCA(dim, num_heads=num_heads, qkv_bias=qkv_bias, attn_drop=attn_drop, proj_drop=drop)self.norm = LayerNorm(dim, eps=1e-6)self.pwconv1 = nn.Linear(dim, expan_ratio * dim)  # pointwise/1x1 convs, implemented with linear layersself.act = nn.GELU()  # TODO: MobileViT is using 'swish'self.pwconv2 = nn.Linear(expan_ratio * dim, dim)self.gamma = nn.Parameter(layer_scale_init_value * torch.ones((dim)),requires_grad=True) if layer_scale_init_value > 0 else Noneself.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()def forward(self, x):input = x# 切分通道，加入深度卷积来增加多尺度表示spx = torch.split(x, self.width, 1)for i in range(self.nums):if i == 0:sp = spx[i]else:sp = sp + spx[i]sp = self.convs[i](sp)if i == 0:out = spelse:out = torch.cat((out, sp), 1)x = torch.cat((out, spx[self.nums]), 1)# XCA，在通道上计算注意力B, C, H, W = x.shapex = x.reshape(B, C, H * W).permute(0, 2, 1)if self.pos_embd:# 加入位置坐标pos_encoding = self.pos_embd(B, H, W).reshape(B, -1, x.shape[1]).permute(0, 2, 1)x = x + pos_encodingx = x + self.drop_path(self.gamma_xca * self.xca(self.norm_xca(x)))x = x.reshape(B, H, W, C)# Inverted Bottleneck，倒残差结构。# 倒残差结构的特点是：先对输入特征通道扩张，再提取特征，最后输出相应的特征通道，对于通道数来说有点中间大两头小类似于梭子的形状，所以称这样的结构为 Inverted residuals（倒残差结构）。x = self.norm(x)x = self.pwconv1(x)x = self.act(x)x = self.pwconv2(x)if self.gamma is not None:x = self.gamma * xx = x.permute(0, 3, 1, 2)  # (N, H, W, C) -> (N, C, H, W)x = input + self.drop_path(x)return x