Co-scale conv-attentional image transformer代码

2024-02-25 06:59

本文主要是介绍Co-scale conv-attentional image transformer代码,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

首先这次主要看CoaT-Lite small的代码。因为他还有CoaT代码,等下一步再看。

代码地址:代码每一步debug后的维度都批注在代码后面。

mlpc-ucsd/CoaT: (ICCV 2021 Oral) CoaT: Co-Scale Conv-Attentional Image Transformers (github.com)

""" 
CoaT architecture.Modified from timm/models/vision_transformer.py
"""import torch
import torch.nn as nn
import torch.nn.functional as F
from torchsummary import summary
from timm.data import IMAGENET_DEFAULT_MEAN, IMAGENET_DEFAULT_STD
from timm.models.layers import DropPath, to_2tuple, trunc_normal_
from timm.models.registry import register_modelfrom einops import rearrange
from functools import partial
from torch import nn, einsum__all__ = ["coat_tiny","coat_mini","coat_small","coat_lite_tiny","coat_lite_mini","coat_lite_small"
]def _cfg_coat(url='', **kwargs):return {'url': url,'num_classes': 1000, 'input_size': (3, 224, 224), 'pool_size': None,'crop_pct': .9, 'interpolation': 'bicubic','mean': IMAGENET_DEFAULT_MEAN, 'std': IMAGENET_DEFAULT_STD,'first_conv': 'patch_embed.proj', 'classifier': 'head',**kwargs}class Mlp(nn.Module):""" Feed-forward network (FFN, a.k.a. MLP) class. """def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU, drop=0.):super().__init__()out_features = out_features or in_featureshidden_features = hidden_features or in_featuresself.fc1 = nn.Linear(in_features, hidden_features)self.act = act_layer()self.fc2 = nn.Linear(hidden_features, out_features)self.drop = nn.Dropout(drop)def forward(self, x):x = self.fc1(x)x = self.act(x)x = self.drop(x)x = self.fc2(x)x = self.drop(x)return xclass ConvRelPosEnc(nn.Module):""" Convolutional relative position encoding. """def __init__(self, Ch, h, window): #(8,8,window=crpe_window={3:2, 5:3, 7:3})"""Initialization.Ch: Channels per head.h: Number of heads.window: Window size(s) in convolutional relative positional encoding. It can have two forms:1. An integer of window size, which assigns all attention heads with the same window size in ConvRelPosEnc.2. A dict mapping window size to #attention head splits (e.g. {window size 1: #attention head split 1, window size 2: #attention head split 2})It will apply different window size to the attention head splits."""#embed_dims=[64, 128, 320, 512], serial_depths=[3, 4, 6, 3], parallel_depth=0, num_heads=8, mlp_ratios=[8, 8, 4, 4], **kwargssuper().__init__()if isinstance(window, int):window = {window: h}                                                         # Set the same window size for all attention heads.self.window = windowelif isinstance(window, dict):self.window = window #{3:2, 5:3, 7:3}else:raise ValueError()            self.conv_list = nn.ModuleList()self.head_splits = []for cur_window, cur_head_split in window.items():#(3,2)/(5,3)/(7,3)dilation = 1                                                                 # Use dilation=1 at default.padding_size = (cur_window + (cur_window - 1) * (dilation - 1)) // 2 #(3+(2)*(0))//2 =1     # Determine padding size. Ref: https://discuss.pytorch.org/t/how-to-keep-the-shape-of-input-and-output-same-when-dilation-conv/14338cur_conv = nn.Conv2d(cur_head_split*Ch, cur_head_split*Ch, #(16,16,k=3,p=1,1,16)/(18,18,k=5,p=2,1,16)kernel_size=(cur_window, cur_window),padding=(padding_size, padding_size),dilation=(dilation, dilation),                          groups=cur_head_split*Ch,)self.conv_list.append(cur_conv)self.head_splits.append(cur_head_split)#(2,3,3)self.channel_splits = [x*Ch for x in self.head_splits] #ch = 8 , head_splits=[2,3,3]def forward(self, q, v, size): #size(q=v)=(1,8,19201,8)B, h, N, Ch = q.shape # B:1 h:8 N:19201 Ch:8H, W = size #(120,160)assert N == 1 + H * W# Convolutional relative position encoding.q_img = q[:,:,1:,:]#(1,8,19200,8)                                                            # Shape: [B, h, H*W, Ch].v_img = v[:,:,1:,:]#(1,8,19200,8)                                                              # Shape: [B, h, H*W, Ch].v_img = rearrange(v_img, 'B h (H W) Ch -> B (h Ch) H W', H=H, W=W)#(1,64,120,160)            # Shape: [B, h, H*W, Ch] -> [B, h*Ch, H, W].v_img_list = torch.split(v_img, self.channel_splits, dim=1)  #channel_splits=[16,24,24]                    # Split according to channels.conv_v_img_list = [conv(x) for conv, x in zip(self.conv_list, v_img_list)]#[(1,16,120,160),(1,24,120,160),(1,24,120,160)]conv_v_img = torch.cat(conv_v_img_list, dim=1)#(1,64,120,160)conv_v_img = rearrange(conv_v_img, 'B (h Ch) H W -> B h (H W) Ch', h=h)#(1,8,19200,8)       # Shape: [B, h*Ch, H, W] -> [B, h, H*W, Ch].EV_hat_img = q_img * conv_v_img#(1,8,19200,8)zero = torch.zeros((B, h, 1, Ch), dtype=q.dtype, layout=q.layout, device=q.device)#(1,8,1,8)EV_hat = torch.cat((zero, EV_hat_img), dim=2)  #(1,8,19201,8)           # Shape: [B, h, N, Ch].return EV_hatclass FactorAtt_ConvRelPosEnc(nn.Module):""" Factorized attention with convolutional relative position encoding class. """def __init__(self, dim, num_heads=8, qkv_bias=False, qk_scale=None, attn_drop=0., proj_drop=0., shared_crpe=None):super().__init__()self.num_heads = num_headshead_dim = dim // num_headsself.scale = qk_scale or head_dim ** -0.5self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)self.attn_drop = nn.Dropout(attn_drop)                                           # Note: attn_drop is actually not used.self.proj = nn.Linear(dim, dim)self.proj_drop = nn.Dropout(proj_drop)# Shared convolutional relative position encoding.self.crpe = shared_crpedef forward(self, x, size):B, N, C = x.shape #(1,19201,64)# Generate Q, K, V.qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4) #(3,1,8,19201,8) # Shape: [3, B, h, N, Ch].q, k, v = qkv[0], qkv[1], qkv[2]     #(1,8,19201,8)                                            # Shape: [B, h, N, Ch].# Factorized attention.k_softmax = k.softmax(dim=2)                                                     # Softmax on dim N.k_softmax_T_dot_v = einsum('b h n k, b h n v -> b h k v', k_softmax, v)  #(1,8,8,8)        # Shape: [B, h, Ch, Ch].factor_att        = einsum('b h n k, b h k v -> b h n v', q, k_softmax_T_dot_v) #(1,8,19201,8) # Shape: [B, h, N, Ch].# Convolutional relative position encoding.crpe = self.crpe(q, v, size=size)      #(1,8,19201,8)                                           # Shape: [B, h, N, Ch].# Merge and reshape.x = self.scale * factor_att + crpe #(1,8,19201,8)x = x.transpose(1, 2).reshape(B, N, C)#(1,19201,64)                                          # Shape: [B, h, N, Ch] -> [B, N, h, Ch] -> [B, N, C].# Output projection.x = self.proj(x)#(1,19201,64)x = self.proj_drop(x)return x                                                                         # Shape: [B, N, C].class ConvPosEnc(nn.Module):""" Convolutional Position Encoding. Note: This module is similar to the conditional position encoding in CPVT."""def __init__(self, dim, k=3):super(ConvPosEnc, self).__init__()self.proj = nn.Conv2d(dim, dim, k, 1, k//2, groups=dim) def forward(self, x, size):B, N, C = x.shape #(1,19201,64)H, W = size #(120,160)assert N == 1 + H * W# Extract CLS token and image tokens.cls_token, img_tokens = x[:, :1], x[:, 1:] #(1,1,64),(1,19200,64)   # Shape: [B, 1, C], [B, H*W, C].# Depthwise convolution.feat = img_tokens.transpose(1, 2).view(B, C, H, W)#(1,64,120,160)x = self.proj(feat) + feat#(1,64,120,160)x = x.flatten(2).transpose(1, 2)# Combine with CLS token.x = torch.cat((cls_token, x), dim=1)#(1,19200,64)return xclass SerialBlock(nn.Module):""" Serial block class.Note: In this implementation, each serial block only contains a conv-attention and a FFN (MLP) module. """def __init__(self, dim, num_heads, mlp_ratio=4., qkv_bias=False, qk_scale=None, drop=0., attn_drop=0.,drop_path=0., act_layer=nn.GELU, norm_layer=nn.LayerNorm,shared_cpe=None, shared_crpe=None):# shared_cpe=self.cpe1, shared_crpe=self.crpe1super().__init__()# Conv-Attention.self.cpe = shared_cpeself.norm1 = norm_layer(dim)self.factoratt_crpe = FactorAtt_ConvRelPosEnc(dim, num_heads=num_heads, qkv_bias=qkv_bias, qk_scale=qk_scale, attn_drop=attn_drop, proj_drop=drop, shared_crpe=shared_crpe)self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()# MLP.self.norm2 = norm_layer(dim)mlp_hidden_dim = int(dim * mlp_ratio)self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim, act_layer=act_layer, drop=drop)def forward(self, x, size):# Conv-Attention.x = self.cpe(x, size)    #[(1,19201,64),(120,160)]/[(1,4801,128),(60,80)]               # Apply convolutional position encoding.cur = self.norm1(x)cur = self.factoratt_crpe(cur, size) #(1,19201,64)/(1,4801,128)   # Apply factorized attention and convolutional relative position encoding.x = x + self.drop_path(cur) #(1,19201,64)/(1,4801,128)# MLP. cur = self.norm2(x)cur = self.mlp(cur)x = x + self.drop_path(cur)return xclass ParallelBlock(nn.Module):""" Parallel block class. """def __init__(self, dims, num_heads, mlp_ratios=[], qkv_bias=False, qk_scale=None, drop=0., attn_drop=0.,drop_path=0., act_layer=nn.GELU, norm_layer=nn.LayerNorm,shared_cpes=None, shared_crpes=None):super().__init__()# Conv-Attention.self.cpes = shared_cpesself.norm12 = norm_layer(dims[1])self.norm13 = norm_layer(dims[2])self.norm14 = norm_layer(dims[3])self.factoratt_crpe2 = FactorAtt_ConvRelPosEnc(dims[1], num_heads=num_heads, qkv_bias=qkv_bias, qk_scale=qk_scale, attn_drop=attn_drop, proj_drop=drop, shared_crpe=shared_crpes[1])self.factoratt_crpe3 = FactorAtt_ConvRelPosEnc(dims[2], num_heads=num_heads, qkv_bias=qkv_bias, qk_scale=qk_scale, attn_drop=attn_drop, proj_drop=drop, shared_crpe=shared_crpes[2])self.factoratt_crpe4 = FactorAtt_ConvRelPosEnc(dims[3], num_heads=num_heads, qkv_bias=qkv_bias, qk_scale=qk_scale, attn_drop=attn_drop, proj_drop=drop, shared_crpe=shared_crpes[3])self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()# MLP.self.norm22 = norm_layer(dims[1])self.norm23 = norm_layer(dims[2])self.norm24 = norm_layer(dims[3])assert dims[1] == dims[2] == dims[3]                              # In parallel block, we assume dimensions are the same and share the linear transformation.assert mlp_ratios[1] == mlp_ratios[2] == mlp_ratios[3]mlp_hidden_dim = int(dims[1] * mlp_ratios[1])self.mlp2 = self.mlp3 = self.mlp4 = Mlp(in_features=dims[1], hidden_features=mlp_hidden_dim, act_layer=act_layer, drop=drop)def upsample(self, x, output_size, size):""" Feature map up-sampling. """return self.interpolate(x, output_size=output_size, size=size)def downsample(self, x, output_size, size):""" Feature map down-sampling. """return self.interpolate(x, output_size=output_size, size=size)def interpolate(self, x, output_size, size):""" Feature map interpolation. """B, N, C = x.shapeH, W = sizeassert N == 1 + H * Wcls_token  = x[:, :1, :]img_tokens = x[:, 1:, :]img_tokens = img_tokens.transpose(1, 2).reshape(B, C, H, W)img_tokens = F.interpolate(img_tokens, size=output_size, mode='bilinear')  # FIXME: May have alignment issue.img_tokens = img_tokens.reshape(B, C, -1).transpose(1, 2)out = torch.cat((cls_token, img_tokens), dim=1)return outdef forward(self, x1, x2, x3, x4, sizes):_, (H2, W2), (H3, W3), (H4, W4) = sizes# Conv-Attention.x2 = self.cpes[1](x2, size=(H2, W2))  # Note: x1 is ignored.x3 = self.cpes[2](x3, size=(H3, W3))x4 = self.cpes[3](x4, size=(H4, W4))cur2 = self.norm12(x2)cur3 = self.norm13(x3)cur4 = self.norm14(x4)cur2 = self.factoratt_crpe2(cur2, size=(H2,W2))cur3 = self.factoratt_crpe3(cur3, size=(H3,W3))cur4 = self.factoratt_crpe4(cur4, size=(H4,W4))upsample3_2 = self.upsample(cur3, output_size=(H2,W2), size=(H3,W3))upsample4_3 = self.upsample(cur4, output_size=(H3,W3), size=(H4,W4))upsample4_2 = self.upsample(cur4, output_size=(H2,W2), size=(H4,W4))downsample2_3 = self.downsample(cur2, output_size=(H3,W3), size=(H2,W2))downsample3_4 = self.downsample(cur3, output_size=(H4,W4), size=(H3,W3))downsample2_4 = self.downsample(cur2, output_size=(H4,W4), size=(H2,W2))cur2 = cur2  + upsample3_2   + upsample4_2cur3 = cur3  + upsample4_3   + downsample2_3cur4 = cur4  + downsample3_4 + downsample2_4x2 = x2 + self.drop_path(cur2) x3 = x3 + self.drop_path(cur3) x4 = x4 + self.drop_path(cur4) # MLP. cur2 = self.norm22(x2)cur3 = self.norm23(x3)cur4 = self.norm24(x4)cur2 = self.mlp2(cur2)cur3 = self.mlp3(cur3)cur4 = self.mlp4(cur4)x2 = x2 + self.drop_path(cur2)x3 = x3 + self.drop_path(cur3)x4 = x4 + self.drop_path(cur4) return x1, x2, x3, x4class PatchEmbed(nn.Module):""" Image to Patch Embedding """def __init__(self, patch_size=16, in_chans=3, embed_dim=768):super().__init__()patch_size = to_2tuple(patch_size)self.patch_size = patch_size #4self.proj = nn.Conv2d(in_chans, embed_dim, kernel_size=patch_size, stride=patch_size)#(3,64,4,4)self.norm = nn.LayerNorm(embed_dim)def forward(self, x):_, _, H, W = x.shapeout_H, out_W = H // self.patch_size[0], W // self.patch_size[1] #(120,160)/(80,60)x = self.proj(x).flatten(2).transpose(1, 2)#(1,19200,64)/(1,4800,128)out = self.norm(x)#(1,19200,64)return out, (out_H, out_W)class CoaT(nn.Module):""" CoaT class. """def __init__(self, patch_size=16, in_chans=3, num_classes=1000, embed_dims=[0, 0, 0, 0], serial_depths=[3,4,6,3], parallel_depth=0,num_heads=0, mlp_ratios=[0, 0, 0, 0], qkv_bias=True, qk_scale=None, drop_rate=0., attn_drop_rate=0.,drop_path_rate=0., norm_layer=partial(nn.LayerNorm, eps=1e-6),return_interm_layers=False, out_features=None, crpe_window={3:2, 5:3, 7:3},**kwargs):super().__init__()self.return_interm_layers = return_interm_layersself.out_features = out_featuresself.num_classes = num_classes #1000# Patch embeddings.self.patch_embed1 = PatchEmbed(patch_size=patch_size, in_chans=in_chans, embed_dim=embed_dims[0])self.patch_embed2 = PatchEmbed(patch_size=2, in_chans=embed_dims[0], embed_dim=embed_dims[1])self.patch_embed3 = PatchEmbed(patch_size=2, in_chans=embed_dims[1], embed_dim=embed_dims[2])self.patch_embed4 = PatchEmbed(patch_size=2, in_chans=embed_dims[2], embed_dim=embed_dims[3])# Class tokens.self.cls_token1 = nn.Parameter(torch.zeros(1, 1, embed_dims[0])) #(1,1,64)self.cls_token2 = nn.Parameter(torch.zeros(1, 1, embed_dims[1]))#(1,1,128)self.cls_token3 = nn.Parameter(torch.zeros(1, 1, embed_dims[2]))self.cls_token4 = nn.Parameter(torch.zeros(1, 1, embed_dims[3]))# Convolutional position encodings.self.cpe1 = ConvPosEnc(dim=embed_dims[0], k=3) #(64,k=3)self.cpe2 = ConvPosEnc(dim=embed_dims[1], k=3) #(128,k=3)self.cpe3 = ConvPosEnc(dim=embed_dims[2], k=3) #(320,k=3)self.cpe4 = ConvPosEnc(dim=embed_dims[3], k=3) #(512,k=3)# Convolutional relative position encodings.self.crpe1 = ConvRelPosEnc(Ch=embed_dims[0] // num_heads, h=num_heads, window=crpe_window)self.crpe2 = ConvRelPosEnc(Ch=embed_dims[1] // num_heads, h=num_heads, window=crpe_window)self.crpe3 = ConvRelPosEnc(Ch=embed_dims[2] // num_heads, h=num_heads, window=crpe_window)self.crpe4 = ConvRelPosEnc(Ch=embed_dims[3] // num_heads, h=num_heads, window=crpe_window)# Enable stochastic depth.dpr = drop_path_rate# Serial blocks 1.self.serial_blocks1 = nn.ModuleList([SerialBlock(dim=embed_dims[0], num_heads=num_heads, mlp_ratio=mlp_ratios[0], qkv_bias=qkv_bias, qk_scale=qk_scale,drop=drop_rate, attn_drop=attn_drop_rate, drop_path=dpr, norm_layer=norm_layer, shared_cpe=self.cpe1, shared_crpe=self.crpe1)for _ in range(serial_depths[0])])# Serial blocks 2.self.serial_blocks2 = nn.ModuleList([SerialBlock(dim=embed_dims[1], num_heads=num_heads, mlp_ratio=mlp_ratios[1], qkv_bias=qkv_bias, qk_scale=qk_scale,drop=drop_rate, attn_drop=attn_drop_rate, drop_path=dpr, norm_layer=norm_layer, shared_cpe=self.cpe2, shared_crpe=self.crpe2)for _ in range(serial_depths[1])])# Serial blocks 3.self.serial_blocks3 = nn.ModuleList([SerialBlock(dim=embed_dims[2], num_heads=num_heads, mlp_ratio=mlp_ratios[2], qkv_bias=qkv_bias, qk_scale=qk_scale,drop=drop_rate, attn_drop=attn_drop_rate, drop_path=dpr, norm_layer=norm_layer, shared_cpe=self.cpe3, shared_crpe=self.crpe3)for _ in range(serial_depths[2])])# Serial blocks 4.self.serial_blocks4 = nn.ModuleList([SerialBlock(dim=embed_dims[3], num_heads=num_heads, mlp_ratio=mlp_ratios[3], qkv_bias=qkv_bias, qk_scale=qk_scale,drop=drop_rate, attn_drop=attn_drop_rate, drop_path=dpr, norm_layer=norm_layer, shared_cpe=self.cpe4, shared_crpe=self.crpe4)for _ in range(serial_depths[3])])# Parallel blocks.self.parallel_depth = parallel_depthif self.parallel_depth > 0:self.parallel_blocks = nn.ModuleList([ParallelBlock(dims=embed_dims, num_heads=num_heads, mlp_ratios=mlp_ratios, qkv_bias=qkv_bias, qk_scale=qk_scale,drop=drop_rate, attn_drop=attn_drop_rate, drop_path=dpr, norm_layer=norm_layer, shared_cpes=[self.cpe1, self.cpe2, self.cpe3, self.cpe4],shared_crpes=[self.crpe1, self.crpe2, self.crpe3, self.crpe4])for _ in range(parallel_depth)])# Classification head(s).if not self.return_interm_layers:self.norm1 = norm_layer(embed_dims[0])self.norm2 = norm_layer(embed_dims[1])self.norm3 = norm_layer(embed_dims[2])self.norm4 = norm_layer(embed_dims[3])if self.parallel_depth > 0:                                  # CoaT series: Aggregate features of last three scales for classification.assert embed_dims[1] == embed_dims[2] == embed_dims[3]self.aggregate = torch.nn.Conv1d(in_channels=3, out_channels=1, kernel_size=1)self.head = nn.Linear(embed_dims[3], num_classes)else:self.head = nn.Linear(embed_dims[3], num_classes)        # CoaT-Lite series: Use feature of last scale for classification.# Initialize weights.trunc_normal_(self.cls_token1, std=.02)trunc_normal_(self.cls_token2, std=.02)trunc_normal_(self.cls_token3, std=.02)trunc_normal_(self.cls_token4, std=.02)self.apply(self._init_weights)def _init_weights(self, m):if isinstance(m, nn.Linear):trunc_normal_(m.weight, std=.02)if isinstance(m, nn.Linear) and m.bias is not None:nn.init.constant_(m.bias, 0)elif isinstance(m, nn.LayerNorm):nn.init.constant_(m.bias, 0)nn.init.constant_(m.weight, 1.0)@torch.jit.ignoredef no_weight_decay(self):return {'cls_token1', 'cls_token2', 'cls_token3', 'cls_token4'}def get_classifier(self):return self.headdef reset_classifier(self, num_classes, global_pool=''):self.num_classes = num_classesself.head = nn.Linear(self.embed_dim, num_classes) if num_classes > 0 else nn.Identity()def insert_cls(self, x, cls_token):""" Insert CLS token. """cls_tokens = cls_token.expand(x.shape[0], -1, -1)#(1,1,64)->(1,1,64)x = torch.cat((cls_tokens, x), dim=1) #(1,19201,64)return xdef remove_cls(self, x):""" Remove CLS token. """return x[:, 1:, :]def forward_features(self, x0): #(1,3,482,640)B = x0.shape[0]#1# Serial blocks 1.x1, (H1, W1) = self.patch_embed1(x0) ##(1,19200,64),(120,160)x1 = self.insert_cls(x1, self.cls_token1) #(1,19201,64)for blk in self.serial_blocks1:x1 = blk(x1, size=(H1, W1)) #迭代四次(1,19201,64)x1_nocls = self.remove_cls(x1) #(1,19200,64)x1_nocls = x1_nocls.reshape(B, H1, W1, -1).permute(0, 3, 1, 2).contiguous() #(1,64,120,160)# Serial blocks 2.x2, (H2, W2) = self.patch_embed2(x1_nocls)#(1,4800,128),(60,80)x2 = self.insert_cls(x2, self.cls_token2)#(1,4801,128)for blk in self.serial_blocks2:x2 = blk(x2, size=(H2, W2)) #(1,4801,128)x2_nocls = self.remove_cls(x2) #(1,4800,128)x2_nocls = x2_nocls.reshape(B, H2, W2, -1).permute(0, 3, 1, 2).contiguous() #(1,128,60,80)# Serial blocks 3.x3, (H3, W3) = self.patch_embed3(x2_nocls) #[(1,1200,320),(30,40)]x3 = self.insert_cls(x3, self.cls_token3) #(1,1201,320)for blk in self.serial_blocks3:x3 = blk(x3, size=(H3, W3))#(1,1201,320)x3_nocls = self.remove_cls(x3)#(1,1200,320)x3_nocls = x3_nocls.reshape(B, H3, W3, -1).permute(0, 3, 1, 2).contiguous()#(1,320,30,40)# Serial blocks 4.x4, (H4, W4) = self.patch_embed4(x3_nocls)#[(1,300,512),(15,20)]x4 = self.insert_cls(x4, self.cls_token4)#(1,301,512)for blk in self.serial_blocks4:x4 = blk(x4, size=(H4, W4))#(1,301,512)x4_nocls = self.remove_cls(x4)#(1,300,512)x4_nocls = x4_nocls.reshape(B, H4, W4, -1).permute(0, 3, 1, 2).contiguous()#(1,512,15,20)# Only serial blocks: Early return.if self.parallel_depth == 0:if self.return_interm_layers:   # Return intermediate features for down-stream tasks (e.g. Deformable DETR and Detectron2).feat_out = {}   if 'x1_nocls' in self.out_features:feat_out['x1_nocls'] = x1_noclsif 'x2_nocls' in self.out_features:feat_out['x2_nocls'] = x2_noclsif 'x3_nocls' in self.out_features:feat_out['x3_nocls'] = x3_noclsif 'x4_nocls' in self.out_features:feat_out['x4_nocls'] = x4_noclsreturn feat_outelse:                           # Return features for classification.x4 = self.norm4(x4) #(1,301,512)x4_cls = x4[:, 0]#(1,512),取第一列所有行元素。return x4_cls# Parallel blocks.for blk in self.parallel_blocks:x1, x2, x3, x4 = blk(x1, x2, x3, x4, sizes=[(H1, W1), (H2, W2), (H3, W3), (H4, W4)])if self.return_interm_layers:       # Return intermediate features for down-stream tasks (e.g. Deformable DETR and Detectron2).feat_out = {}   if 'x1_nocls' in self.out_features:x1_nocls = self.remove_cls(x1)x1_nocls = x1_nocls.reshape(B, H1, W1, -1).permute(0, 3, 1, 2).contiguous()feat_out['x1_nocls'] = x1_noclsif 'x2_nocls' in self.out_features:x2_nocls = self.remove_cls(x2)x2_nocls = x2_nocls.reshape(B, H2, W2, -1).permute(0, 3, 1, 2).contiguous()feat_out['x2_nocls'] = x2_noclsif 'x3_nocls' in self.out_features:x3_nocls = self.remove_cls(x3)x3_nocls = x3_nocls.reshape(B, H3, W3, -1).permute(0, 3, 1, 2).contiguous()feat_out['x3_nocls'] = x3_noclsif 'x4_nocls' in self.out_features:x4_nocls = self.remove_cls(x4)x4_nocls = x4_nocls.reshape(B, H4, W4, -1).permute(0, 3, 1, 2).contiguous()feat_out['x4_nocls'] = x4_noclsreturn feat_outelse:x2 = self.norm2(x2)x3 = self.norm3(x3)x4 = self.norm4(x4)x2_cls = x2[:, :1]              # Shape: [B, 1, C].x3_cls = x3[:, :1]x4_cls = x4[:, :1]merged_cls = torch.cat((x2_cls, x3_cls, x4_cls), dim=1)       # Shape: [B, 3, C].merged_cls = self.aggregate(merged_cls).squeeze(dim=1)        # Shape: [B, C].return merged_clsdef forward(self, x):if self.return_interm_layers:       # Return intermediate features (for down-stream tasks).return self.forward_features(x)else:                               # Return features for classification.x = self.forward_features(x) #(1,512)x = self.head(x)#(1,1000)return x# CoaT.
@register_model
def coat_tiny(**kwargs):model = CoaT(patch_size=4, embed_dims=[152, 152, 152, 152], serial_depths=[2, 2, 2, 2], parallel_depth=6, num_heads=8, mlp_ratios=[4, 4, 4, 4], **kwargs)model.default_cfg = _cfg_coat()return model@register_model
def coat_mini(**kwargs):model = CoaT(patch_size=4, embed_dims=[152, 216, 216, 216], serial_depths=[2, 2, 2, 2], parallel_depth=6, num_heads=8, mlp_ratios=[4, 4, 4, 4], **kwargs)model.default_cfg = _cfg_coat()return model@register_model
def coat_small(**kwargs):model = CoaT(patch_size=4, embed_dims=[152, 320, 320, 320], serial_depths=[2, 2, 2, 2], parallel_depth=6, num_heads=8, mlp_ratios=[4, 4, 4, 4], **kwargs)model.default_cfg = _cfg_coat()return model# CoaT-Lite.
@register_model
def coat_lite_tiny(**kwargs):model = CoaT(patch_size=4, embed_dims=[64, 128, 256, 320], serial_depths=[2, 2, 2, 2], parallel_depth=0, num_heads=8, mlp_ratios=[8, 8, 4, 4], **kwargs)model.default_cfg = _cfg_coat()return model@register_model
def coat_lite_mini(**kwargs):model = CoaT(patch_size=4, embed_dims=[64, 128, 320, 512], serial_depths=[2, 2, 2, 2], parallel_depth=0, num_heads=8, mlp_ratios=[8, 8, 4, 4], **kwargs)model.default_cfg = _cfg_coat()return model@register_model
def coat_lite_small(**kwargs):model = CoaT(patch_size=4, embed_dims=[64, 128, 320, 512], serial_depths=[3, 4, 6, 3], parallel_depth=0, num_heads=8, mlp_ratios=[8, 8, 4, 4], **kwargs)model.default_cfg = _cfg_coat()return model@register_model
def coat_lite_medium(**kwargs):model = CoaT(patch_size=4, embed_dims=[128, 256, 320, 512], serial_depths=[3, 6, 10, 8], parallel_depth=0, num_heads=8, mlp_ratios=[4, 4, 4, 4], **kwargs)model.default_cfg = _cfg_coat()return modeldef main():model = coat_lite_small()  # (传入参数)# summary(model,input_size=(3,480,640),device='cpu')model.eval()rgb_image = torch.randn(1,3, 480, 640)with torch.no_grad():output = model(rgb_image)print(output.shape)
if __name__ == '__main__':main()

首先照例看一下框架图:

框架图的每一部分:

 

 

 1:模型首先输入到serial block,在block内,图片首先进行patch embedding。对应于主函数        CoaT的forward_features函数。首先给出CoaT的一些参数,这样就替换掉原始的默认参数。

def coat_lite_small(**kwargs):model = CoaT(patch_size=4, embed_dims=[64, 128, 320, 512], serial_depths=[3, 4, 6, 3], parallel_depth=0, num_heads=8, mlp_ratios=[8, 8, 4, 4], **kwargs)model.default_cfg = _cfg_coat()return model

在第一个block阶段,patch=4,inchannel=3,embed_dims[0]=64。我们跳到patch embedding函数中。首先获得输出的H和W,原始输入为(1,3,480,640)。接着将输入维度3投射为64,展平,交换1,2位。再经过归一化,那么输出的维度为(1,19200,64)。

class PatchEmbed(nn.Module):""" Image to Patch Embedding """def __init__(self, patch_size=16, in_chans=3, embed_dim=768):super().__init__()patch_size = to_2tuple(patch_size)self.patch_size = patch_size #4self.proj = nn.Conv2d(in_chans, embed_dim, kernel_size=patch_size, stride=patch_size)#(3,64,4,4)self.norm = nn.LayerNorm(embed_dim)def forward(self, x):_, _, H, W = x.shapeout_H, out_W = H // self.patch_size[0], W // self.patch_size[1] #(120,160)/(80,60)x = self.proj(x).flatten(2).transpose(1, 2)#(1,19200,64)/(1,4800,128)out = self.norm(x)#(1,19200,64)return out, (out_H, out_W)

然后插入classtoken,classtoken维度为(1,1,64),新的维度为(1,19201,64)。接着就进入了conv-attention block。

2:在第一个阶段有三个serialblock。首先给出block的参数。有8个头,注意shared_cpe=self.cpe1, shared_crpe=self.crpe1这两个重要的函数。

        self.serial_blocks1 = nn.ModuleList([SerialBlock(dim=embed_dims[0], num_heads=num_heads, mlp_ratio=mlp_ratios[0], qkv_bias=qkv_bias, qk_scale=qk_scale,drop=drop_rate, attn_drop=attn_drop_rate, drop_path=dpr, norm_layer=norm_layer, shared_cpe=self.cpe1, shared_crpe=self.crpe1)for _ in range(serial_depths[0])])

我们进入到conv-attention block中:输入的x首先进行卷积位置编码。

class SerialBlock(nn.Module):""" Serial block class.Note: In this implementation, each serial block only contains a conv-attention and a FFN (MLP) module. """def __init__(self, dim, num_heads, mlp_ratio=4., qkv_bias=False, qk_scale=None, drop=0., attn_drop=0.,drop_path=0., act_layer=nn.GELU, norm_layer=nn.LayerNorm,shared_cpe=None, shared_crpe=None):# shared_cpe=self.cpe1, shared_crpe=self.crpe1super().__init__()# Conv-Attention.self.cpe = shared_cpeself.norm1 = norm_layer(dim)self.factoratt_crpe = FactorAtt_ConvRelPosEnc(dim, num_heads=num_heads, qkv_bias=qkv_bias, qk_scale=qk_scale, attn_drop=attn_drop, proj_drop=drop, shared_crpe=shared_crpe)self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()# MLP.self.norm2 = norm_layer(dim)mlp_hidden_dim = int(dim * mlp_ratio)self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim, act_layer=act_layer, drop=drop)def forward(self, x, size):# Conv-Attention.x = self.cpe(x, size)    #[(1,19201,64),(120,160)]/[(1,4801,128),(60,80)]               # Apply convolutional position encoding.cur = self.norm1(x)cur = self.factoratt_crpe(cur, size) #(1,19201,64)/(1,4801,128)   # Apply factorized attention and convolutional relative position encoding.x = x + self.drop_path(cur) #(1,19201,64)/(1,4801,128)# MLP. cur = self.norm2(x)cur = self.mlp(cur)x = x + self.drop_path(cur)return x
self.cpe1 = ConvPosEnc(dim=embed_dims[0], k=3) #(64,k=3)

        卷积位置编码对应于 ConvPosEnc函数。首先获得x的形状,然后取图像的token和class的token。因为在patch embed中我们插入了class token。两个token的维度分别为(1,1,64),(1,19200,64)。接着将图像reshape到原来的形状,进行逐深度卷积。然后再展平为token。与原始的token进行concat。

class ConvPosEnc(nn.Module):""" Convolutional Position Encoding. Note: This module is similar to the conditional position encoding in CPVT."""def __init__(self, dim, k=3):super(ConvPosEnc, self).__init__()self.proj = nn.Conv2d(dim, dim, k, 1, k//2, groups=dim) def forward(self, x, size):B, N, C = x.shape #(1,19201,64)H, W = size #(120,160)assert N == 1 + H * W# Extract CLS token and image tokens.cls_token, img_tokens = x[:, :1], x[:, 1:] #(1,1,64),(1,19200,64)   # Shape: [B, 1, C], [B, H*W, C].# Depthwise convolution.feat = img_tokens.transpose(1, 2).view(B, C, H, W)#(1,64,120,160)x = self.proj(feat) + feat#(1,64,120,160)x = x.flatten(2).transpose(1, 2)# Combine with CLS token.x = torch.cat((cls_token, x), dim=1)#(1,19201,64)return x

3:我们回到原SerialBlock函数中,接着进行归一化,再进行factorized attention mechanism。

        self.factoratt_crpe = FactorAtt_ConvRelPosEnc(dim, num_heads=num_heads, qkv_bias=qkv_bias, qk_scale=qk_scale, attn_drop=attn_drop, proj_drop=drop, shared_crpe=shared_crpe)

        首先我们获得qkv。接着分别取第一个维度就是q,k,v。维度为(1,8,19201,8)。根据公式我们要求softmax(K)的转置,然后与V相乘,这里直接用einsum函数得到结果为(1,8,8,8)。然后Q乘以 k_softmax_T_dot_v ,结果再乘以scale函数,得到factor_att。

        接着我们将q和v输入到crep函数。即卷积的相对位置编码。

class FactorAtt_ConvRelPosEnc(nn.Module):""" Factorized attention with convolutional relative position encoding class. """def __init__(self, dim, num_heads=8, qkv_bias=False, qk_scale=None, attn_drop=0., proj_drop=0., shared_crpe=None):super().__init__()self.num_heads = num_headshead_dim = dim // num_headsself.scale = qk_scale or head_dim ** -0.5self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)self.attn_drop = nn.Dropout(attn_drop)                                           # Note: attn_drop is actually not used.self.proj = nn.Linear(dim, dim)self.proj_drop = nn.Dropout(proj_drop)# Shared convolutional relative position encoding.self.crpe = shared_crpedef forward(self, x, size):B, N, C = x.shape #(1,19201,64)# Generate Q, K, V.qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4) #(3,1,8,19201,8) # Shape: [3, B, h, N, Ch].q, k, v = qkv[0], qkv[1], qkv[2]     #(1,8,19201,8)                                            # Shape: [B, h, N, Ch].# Factorized attention.k_softmax = k.softmax(dim=2)                                                     # Softmax on dim N.k_softmax_T_dot_v = einsum('b h n k, b h n v -> b h k v', k_softmax, v)  #(1,8,8,8)        # Shape: [B, h, Ch, Ch].factor_att        = einsum('b h n k, b h k v -> b h n v', q, k_softmax_T_dot_v) #(1,8,19201,8) # Shape: [B, h, N, Ch].# Convolutional relative position encoding.crpe = self.crpe(q, v, size=size)      #(1,8,19201,8)                                           # Shape: [B, h, N, Ch].# Merge and reshape.x = self.scale * factor_att + crpe #(1,8,19201,8)x = x.transpose(1, 2).reshape(B, N, C)#(1,19201,64)                                          # Shape: [B, h, N, Ch] -> [B, N, h, Ch] -> [B, N, C].# Output projection.x = self.proj(x)#(1,19201,64)x = self.proj_drop(x)return x                                                                         # Shape: [B, N, C].
self.crpe1 = ConvRelPosEnc(Ch=embed_dims[0] // num_heads, h=num_heads, window=crpe_window)

        在ConvRelPosEnc中,首先指定(Ch, h, window):(8,8,window=crpe_window={3:2, 5:3, 7:3})参数,首先window是一个字典形式,且注意这一句话:

A dict mapping window size to #attention head splits (e.g. {window size 1: #attention head split 1, window size 2: #attention head split 2})。It will apply different window size to the attention head splits.

这个字典将窗口大小映射为注意力头划分,window size1则注意力头划分为1,window size2则注意力头划分为2,对于注意力头的划分,将会使用不同的窗口大小。

        

 

        遍历字典,我们获得窗口和头划分的大小,第一次遍历cur_window, cur_head_split分别为(3,2)。dialation=1,padding=1,然后cur_conv卷积输入通道16,输出通道16,kernel=3,group=16。第二次遍历:卷积(24,24,k=5,p=2,1,24),第三次遍历:(24,24,k=5,p=2,1,24)。将生成的三个卷积按顺序添加到modul卷积的modulistist中。cur_head_split添加到head_splits空列表中。channel_splits=[16,24,24]。

        回到forward函数中,首先获得不包含class token的q和v。然后将v调整为2d(1,64,120,160)。接着就是将v按通道进行划分。v_img_list包含三个list,维度分别为[(1,16,120,160),(1,24,120,160),(1,24,120,160)]。将每一个list输入到卷积list中的每一个卷积。维度不发生变换。接着将生成的结果按照维度拼接起来。经过reshape又重新回到原图像大小。

        接着将q和逐深度2d卷积结果相乘。结果与0矩阵进行concat。就生成了EV_hat,维度为(1,8,19201,8)。

class ConvRelPosEnc(nn.Module):""" Convolutional relative position encoding. """def __init__(self, Ch, h, window): #(8,8,window=crpe_window={3:2, 5:3, 7:3})"""Initialization.Ch: Channels per head.h: Number of heads.window: Window size(s) in convolutional relative positional encoding. It can have two forms:1. An integer of window size, which assigns all attention heads with the same window size in ConvRelPosEnc.2. A dict mapping window size to #attention head splits (e.g. {window size 1: #attention head split 1, window size 2: #attention head split 2})It will apply different window size to the attention head splits."""#embed_dims=[64, 128, 320, 512], serial_depths=[3, 4, 6, 3], parallel_depth=0, num_heads=8, mlp_ratios=[8, 8, 4, 4], **kwargssuper().__init__()if isinstance(window, int):window = {window: h}                                                         # Set the same window size for all attention heads.self.window = windowelif isinstance(window, dict):self.window = window #{3:2, 5:3, 7:3}else:raise ValueError()            self.conv_list = nn.ModuleList()self.head_splits = []for cur_window, cur_head_split in window.items():#(3,2)/(5,3)/(7,3)dilation = 1                                                                 # Use dilation=1 at default.padding_size = (cur_window + (cur_window - 1) * (dilation - 1)) // 2 #(3+(2)*(0))//2 =1     # Determine padding size. Ref: https://discuss.pytorch.org/t/how-to-keep-the-shape-of-input-and-output-same-when-dilation-conv/14338cur_conv = nn.Conv2d(cur_head_split*Ch, cur_head_split*Ch, #(16,16,k=3,p=1,1,16)/(18,18,k=5,p=2,1,16)kernel_size=(cur_window, cur_window),padding=(padding_size, padding_size),dilation=(dilation, dilation),                          groups=cur_head_split*Ch,)self.conv_list.append(cur_conv)self.head_splits.append(cur_head_split)#(2,3,3)self.channel_splits = [x*Ch for x in self.head_splits] #ch = 8 , head_splits=[2,3,3]def forward(self, q, v, size): #size(q=v)=(1,8,19201,8)B, h, N, Ch = q.shape # B:1 h:8 N:19201 Ch:8H, W = size #(120,160)assert N == 1 + H * W# Convolutional relative position encoding.q_img = q[:,:,1:,:]#(1,8,19200,8)                                                            # Shape: [B, h, H*W, Ch].v_img = v[:,:,1:,:]#(1,8,19200,8)                                                              # Shape: [B, h, H*W, Ch].v_img = rearrange(v_img, 'B h (H W) Ch -> B (h Ch) H W', H=H, W=W)#(1,64,120,160)            # Shape: [B, h, H*W, Ch] -> [B, h*Ch, H, W].v_img_list = torch.split(v_img, self.channel_splits, dim=1)  #channel_splits=[16,24,24]                    # Split according to channels.conv_v_img_list = [conv(x) for conv, x in zip(self.conv_list, v_img_list)]#[(1,16,120,160),(1,24,120,160),(1,24,120,160)]conv_v_img = torch.cat(conv_v_img_list, dim=1)#(1,64,120,160)conv_v_img = rearrange(conv_v_img, 'B (h Ch) H W -> B h (H W) Ch', h=h)#(1,8,19200,8)       # Shape: [B, h*Ch, H, W] -> [B, h, H*W, Ch].EV_hat_img = q_img * conv_v_img#(1,8,19200,8)zero = torch.zeros((B, h, 1, Ch), dtype=q.dtype, layout=q.layout, device=q.device)#(1,8,1,8)EV_hat = torch.cat((zero, EV_hat_img), dim=2)  #(1,8,19201,8)           # Shape: [B, h, N, Ch].return EV_hat

         再回到FactorAtt_ConvRelPosEnc函数中,我们将factorized attention和卷积相对位置编码的结果相加。这样conv-attention就计算完毕。将(1,8,19201,8)的大小reshape到(1,19201,64),经过proj。这样FactorAtt_ConvRelPosEnc计算完毕

        再回到SerialBlock,接着我们输送到前向传播模块,经过mlp层,维度由64-512-64。最终x输出为(1,19201,64)。这样SerialBlock计算完毕。

        在整体的CoaT函数中,self.serial_blocks1包含了三个SerialBlock,那么blk会迭代四次,最终的输出大小为(1,19201,64)。移除掉class token,维度变为(1,19200,64)。在reshape为2d图像大小。(1,64,120,160)。同理block的输出作为block的输入,处理流程和block1一样。最终的大小为(1,128,60,80)。block3最终大小为(1,320,30,40)。block4最终大小为(1,512,15,20)。

        其中我们将还未移除classtoken的x4取出(1,301,512),取其第一列所有元素(1,512)。然后经过一个线性层,输出最终的1000类。这样CoaT-lite就计算完毕

这篇关于Co-scale conv-attentional image transformer代码的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/744724

相关文章

C++使用栈实现括号匹配的代码详解

《C++使用栈实现括号匹配的代码详解》在编程中,括号匹配是一个常见问题,尤其是在处理数学表达式、编译器解析等任务时,栈是一种非常适合处理此类问题的数据结构,能够精确地管理括号的匹配问题,本文将通过C+... 目录引言问题描述代码讲解代码解析栈的状态表示测试总结引言在编程中,括号匹配是一个常见问题,尤其是在

Java调用DeepSeek API的最佳实践及详细代码示例

《Java调用DeepSeekAPI的最佳实践及详细代码示例》:本文主要介绍如何使用Java调用DeepSeekAPI,包括获取API密钥、添加HTTP客户端依赖、创建HTTP请求、处理响应、... 目录1. 获取API密钥2. 添加HTTP客户端依赖3. 创建HTTP请求4. 处理响应5. 错误处理6.

使用 sql-research-assistant进行 SQL 数据库研究的实战指南(代码实现演示)

《使用sql-research-assistant进行SQL数据库研究的实战指南(代码实现演示)》本文介绍了sql-research-assistant工具,该工具基于LangChain框架,集... 目录技术背景介绍核心原理解析代码实现演示安装和配置项目集成LangSmith 配置(可选)启动服务应用场景

Python中顺序结构和循环结构示例代码

《Python中顺序结构和循环结构示例代码》:本文主要介绍Python中的条件语句和循环语句,条件语句用于根据条件执行不同的代码块,循环语句用于重复执行一段代码,文章还详细说明了range函数的使... 目录一、条件语句(1)条件语句的定义(2)条件语句的语法(a)单分支 if(b)双分支 if-else(

MySQL数据库函数之JSON_EXTRACT示例代码

《MySQL数据库函数之JSON_EXTRACT示例代码》:本文主要介绍MySQL数据库函数之JSON_EXTRACT的相关资料,JSON_EXTRACT()函数用于从JSON文档中提取值,支持对... 目录前言基本语法路径表达式示例示例 1: 提取简单值示例 2: 提取嵌套值示例 3: 提取数组中的值注意

CSS3中使用flex和grid实现等高元素布局的示例代码

《CSS3中使用flex和grid实现等高元素布局的示例代码》:本文主要介绍了使用CSS3中的Flexbox和Grid布局实现等高元素布局的方法,通过简单的两列实现、每行放置3列以及全部代码的展示,展示了这两种布局方式的实现细节和效果,详细内容请阅读本文,希望能对你有所帮助... 过往的实现方法是使用浮动加

JAVA调用Deepseek的api完成基本对话简单代码示例

《JAVA调用Deepseek的api完成基本对话简单代码示例》:本文主要介绍JAVA调用Deepseek的api完成基本对话的相关资料,文中详细讲解了如何获取DeepSeekAPI密钥、添加H... 获取API密钥首先,从DeepSeek平台获取API密钥,用于身份验证。添加HTTP客户端依赖使用Jav

Java实现状态模式的示例代码

《Java实现状态模式的示例代码》状态模式是一种行为型设计模式,允许对象根据其内部状态改变行为,本文主要介绍了Java实现状态模式的示例代码,文中通过示例代码介绍的非常详细,需要的朋友们下面随着小编来... 目录一、简介1、定义2、状态模式的结构二、Java实现案例1、电灯开关状态案例2、番茄工作法状态案例

nginx-rtmp-module模块实现视频点播的示例代码

《nginx-rtmp-module模块实现视频点播的示例代码》本文主要介绍了nginx-rtmp-module模块实现视频点播,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习... 目录预置条件Nginx点播基本配置点播远程文件指定多个播放位置参考预置条件配置点播服务器 192.

CSS自定义浏览器滚动条样式完整代码

《CSS自定义浏览器滚动条样式完整代码》:本文主要介绍了如何使用CSS自定义浏览器滚动条的样式,包括隐藏滚动条的角落、设置滚动条的基本样式、轨道样式和滑块样式,并提供了完整的CSS代码示例,通过这些技巧,你可以为你的网站添加个性化的滚动条样式,从而提升用户体验,详细内容请阅读本文,希望能对你有所帮助...