5.关于Deformable Detr

本文主要是介绍5.关于Deformable Detr，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

5.关于Deformable Detr

模型架构

举例源码中使用multi-scale都是四层

在这里插入图片描述

Detr缺点

在进行self-attention时，如果序列过长的话，在进行q和v计算过大，对于过大输入图像计算时间太长
Detr对于小目标检测的效果不好。

Deformable Detr

Deformable Detr 使用的（self-attention）注意力机制与传统transformer的self-attention中所有q要和所有v计算不同，采用对于某个点附近几个点较为关注的点进行计算，不再让所有的q都和v进行计算，大大的减少了计算量。
采用一种multi-scale（多层）的机制来实现对多维度特征的提取，采用一些位置信息来对准不同层次下同一点附近(实际是附近4个点)的信息采集，对于在不同层计算出来的位置，可能不是整数的问题，采用区别内四个点做交叉计算，来解决计算不同层下，计算对应位置点的偏移不是整数的问题。

BlockOne

build_backbone 函数通常用于构建模型的骨干网络（backbone），即特征提取器。常用的骨干网络包括 ResNet、EfficientNet 等。

使用resnet来进行特征提取分层

build_backbone -> Backbone # 调用关系
def build_backbone(args):position_embedding = build_position_encoding(args) #  获取位置编码信息train_backbone = args.lr_backbone > 0return_interm_layers = args.masks or (args.num_feature_levels > 1)backbone = Backbone(args.backbone, train_backbone, return_interm_layers, args.dilation)model = Joiner(backbone, position_embedding) # 将位置编码信息和四个层次的图像信息结果返回return modelclass Backbone(BackboneBase):"""ResNet backbone with frozen BatchNorm."""def __init__(self, name: str,train_backbone: bool,return_interm_layers: bool,dilation: bool):norm_layer = FrozenBatchNorm2dbackbone = getattr(torchvision.models, name)(replace_stride_with_dilation=[False, False, dilation],pretrained=is_main_process(), norm_layer=norm_layer)assert name not in ('resnet18', 'resnet34'), "number of channels are hard coded"super().__init__(backbone, train_backbone, return_interm_layers)if dilation:self.strides[-1] = self.strides[-1] // 2

DeformableTransformerEncoder

计算公式

在这里插入图片描述

M代表多次，Amqk代表Attention，Pmqk代表位置偏移量

可变detr的encoder，是这个模型的关键，在这里完成特征的提取，源码计算和上面图中展示的是不一样，源码中将四个层的信息拉长成一个序列，并且记录每一层的起始位置，最后得到一个很长序列。

在这里插入图片描述

class DeformableTransformerEncoder(nn.Module):...def forward(self, src, spatial_shapes, level_start_index, valid_ratios, pos=None, padding_mask=None):output = src# 获取偏移点信息，每个特征点会有 4 个位置的偏移信息，这些偏移信息也是计算得到的reference_points = self.get_reference_points(spatial_shapes, valid_ratios, device=src.device)for _, layer in enumerate(self.layers): # 经过多个encoder编码器提取特征output = layer(output, pos, reference_points, spatial_shapes, level_start_index, padding_mask)return output

get_reference_points

初始化每一层的参数点，源码中是每个特征点，会有四个参考点，四个参考点的坐标也是学习得到的。

def get_reference_points(spatial_shapes, valid_ratios, device):reference_points_list = []  # 存储每个层级的参考点# 遍历每个层级的空间形状for lvl, (H_, W_) in enumerate(spatial_shapes):# 生成参考点网格ref_y, ref_x = torch.meshgrid(torch.linspace(0.5, H_ - 0.5, H_, dtype=torch.float32, device=device),  # y 轴坐标torch.linspace(0.5, W_ - 0.5, W_, dtype=torch.float32, device=device)   # x 轴坐标)# 展平 y 和 x 坐标，并计算标准化的参考点ref_y = ref_y.reshape(-1)[None] / (valid_ratios[:, None, lvl, 1] * H_)  # y 坐标除以有效比率和层级高度ref_x = ref_x.reshape(-1)[None] / (valid_ratios[:, None, lvl, 0] * W_)  # x 坐标除以有效比率和层级宽度# 将参考点组合成 (x, y) 对ref = torch.stack((ref_x, ref_y), -1)# 将参考点添加到列表中reference_points_list.append(ref)# 将所有层级的参考点连接成一个张量reference_points = torch.cat(reference_points_list, 1)# 根据有效比率调整参考点reference_points = reference_points[:, :, None] * valid_ratios[:, None]return reference_points

MSDeformAttn

encoder最关键的地方，就是注意力机制是如何计算的，对比于传统的transformer的self-attention，这里的attn，既不像传统的self-attention，也不像卷积

在这里插入图片描述

class MSDeformAttn(nn.Module):def forward(self, query, reference_points, input_flatten, input_spatial_shapes, input_level_start_index, input_padding_mask=None):"""前向传播函数:param query:                      查询张量，形状为 (N, Length_{query}, C):param reference_points:           参考点，形状为 (N, Length_{query}, n_levels, 2)，范围在 [0, 1]，左上角 (0,0)，右下角 (1, 1)，包括填充区域或 (N, Length_{query}, n_levels, 4)，添加额外的 (w, h) 形成参考框:param input_flatten:              展平的输入特征图，形状为 (N, \sum_{l=0}^{L-1} H_l \cdot W_l, C):param input_spatial_shapes:       输入的空间形状，形状为 (n_levels, 2)，例如 [(H_0, W_0), (H_1, W_1), ..., (H_{L-1}, W_{L-1})]:param input_level_start_index:    输入的层级开始索引，形状为 (n_levels, )，例如 [0, H_0*W_0, H_0*W_0+H_1*W_1, ...]:param input_padding_mask:         输入的填充掩码，形状为 (N, \sum_{l=0}^{L-1} H_l \cdot W_l)，True 表示填充元素，False 表示非填充元素:return output:                    输出特征，形状为 (N, Length_{query}, C)"""N, Len_q, _ = query.shape  # 获取批量大小和查询长度N, Len_in, _ = input_flatten.shape  # 获取输入展平特征图的维度assert (input_spatial_shapes[:, 0] * input_spatial_shapes[:, 1]).sum() == Len_in  # 确保展平的输入尺寸与空间形状一致# 通过线性变换(全连接)获取 value 张量value = self.value_proj(input_flatten)# 如果有填充掩码，则将填充位置的值设置为 0if input_padding_mask is not None:value = value.masked_fill(input_padding_mask[..., None], float(0))# 重新调整 value 张量的形状value = value.view(N, Len_in, self.n_heads, self.d_model // self.n_heads)# 获取采样偏移量，使用query，进行全连接获得采样偏移量sampling_offsets = self.sampling_offsets(query).view(N, Len_q, self.n_heads, self.n_levels, self.n_points, 2)# 获取注意力权重，同样采用q计算，注意力权重attention_weights = self.attention_weights(query).view(N, Len_q, self.n_heads, self.n_levels * self.n_points)attention_weights = F.softmax(attention_weights, -1).view(N, Len_q, self.n_heads, self.n_levels, self.n_points)# 计算采样位置if reference_points.shape[-1] == 2:# 如果参考点的维度为 2，计算标准化的采样位置offset_normalizer = torch.stack([input_spatial_shapes[..., 1], input_spatial_shapes[..., 0]], -1)sampling_locations = reference_points[:, :, None, :, None, :] \+ sampling_offsets / offset_normalizer[None, None, None, :, None, :]elif reference_points.shape[-1] == 4:# 如果参考点的维度为 4，计算参考框的采样位置sampling_locations = reference_points[:, :, None, :, None, :2] \+ sampling_offsets / self.n_points * reference_points[:, :, None, :, None, 2:] * 0.5else:raise ValueError('Last dim of reference_points must be 2 or 4, but get {} instead.'.format(reference_points.shape[-1]))# 调用自定义的 MSDeformAttnFunction 进行变形注意力计算output = MSDeformAttnFunction.apply(value, input_spatial_shapes, input_level_start_index, sampling_locations, attention_weights, self.im2col_step)# 通过线性变换获取最终输出output = self.output_proj(output)return output

MSDeformAttnFunction注意力计算

由于复制q和v是一个东西，在这里对每层的特征都进行提取，主要就是q得到v，同时使用q经过fc(全连接)得到采样位置的形状，在通过v和采样的权重加权，得到加权后的v，在通过q经过全连接得到attention_weights，再将v和attention_weights加权，得到最终的特征输出

def ms_deform_attn_core_pytorch(value, value_spatial_shapes, sampling_locations, attention_weights):"""变形注意力的核心计算函数（仅用于调试和测试，实际应用中需使用 CUDA 版本）:param value:                     输入特征值，形状为 (N_, S_, M_, D_):param value_spatial_shapes:      输入特征图的空间形状，形状为 (n_levels, 2)，例如 [(H_0, W_0), (H_1, W_1), ...]:param sampling_locations:        采样位置，形状为 (N_, Lq_, M_, L_, P_, 2):param attention_weights:         注意力权重，形状为 (N_, Lq_, M_, L_, P_):return:                         输出特征，形状为 (N_, Length_{query}, C)"""N_, S_, M_, D_ = value.shape  # 获取输入特征的维度_, Lq_, M_, L_, P_, _ = sampling_locations.shape  # 获取采样位置的维度# 将输入特征按照空间形状分割成列表value_list = value.split([H_ * W_ for H_, W_ in value_spatial_shapes], dim=1)# 将采样位置从 [0, 1] 范围映射到 [-1, 1] 范围sampling_grids = 2 * sampling_locations - 1sampling_value_list = []for lid_, (H_, W_) in enumerate(value_spatial_shapes): # 对四个层级都做# 对每个空间层级，调整特征值的形状以适应采样value_l_ = value_list[lid_].flatten(2).transpose(1, 2).reshape(N_*M_, D_, H_, W_)# 调整采样位置的形状sampling_grid_l_ = sampling_grids[:, :, :, lid_].transpose(1, 2).flatten(0, 1)# 使用双线性插值进行采样sampling_value_l_ = F.grid_sample(value_l_, sampling_grid_l_,mode='bilinear', padding_mode='zeros', align_corners=False)sampling_value_list.append(sampling_value_l_)# 重新调整注意力权重的形状attention_weights = attention_weights.transpose(1, 2).reshape(N_*M_, 1, Lq_, L_*P_)# 计算加权平均，得到最终输出特征output = (torch.stack(sampling_value_list, dim=-2).flatten(-2) * attention_weights).sum(-1).view(N_, M_*D_, Lq_)return output.transpose(1, 2).contiguous()