论文阅读:DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution

本文主要是介绍论文阅读:DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

文章目录

      • 1、论文总述
      • 2、baseline:HTC介绍
      • 3、ASPP模块介绍
      • 4、RFP模块的具体实现
      • 5、SAC模块的具体实现
      • 6、SAC与条件卷积的区别
      • 7、SAC中的global context与SENet中的不同
      • 8、 Ablation Studies
      • 9、 State-of-the-art comparison on COCO test-dev
      • 10、SAC和RFP的优势(可视化结果)
      • 参考文献

1、论文总述

本篇论文提出的目标检测模型DetectoRS在COCO数据集上的性能是当前最好(mAP:54.7),在实例分割和全景分割上效果也不错,主要是因为提出的改进方法是 基于backbone和FPN的, 适用于多种视觉任务,其他次优模型如:ResNeSt,CBnet也是基于backbone的改进,也许现在的趋势就是目标检测的网络结构大致已定(除anchor-free系列外),而且也有论文统计过,如今检测网络性能不好的大部分原因是因为目标检测网络的分类性能提不上去,所以现在的改进基本都是基于backbone和FPN的,例如BiFPN也是如此。
论文主要工作有两方面:一是在宏观方面提出了递归FPN:Recursive Feature Pyramid(RFP),就是把FPN的输出先连接到bottom up那儿进行再次输入,然后再输出时候与原FPN的输出再进行结合一起输出;二是在微观方面提出了可切换的空洞卷积:Switchable Atrous Convolution(SAC)。

注: 作者把提出的两个模块加入到HTC这个网络中,baseline就是HTC,作者也非常nice,论文给出的东西都很细,干货比较多,所以建议阅读一下原论文,只是看几篇博客是不能完全理解这些细节的,作者在实验部分对权重的初始化都进行了详细介绍。
在这里插入图片描述
DetectoRS指标对比:
在这里插入图片描述

At the macro level, our proposed Recursive Feature Pyramid (RFP) builds on top of the Feature Pyramid Networks
(FPN) [44] by incorporating extra feedback connections from
the FPN layers into the bottom-up backbone layers, as illustrated in Fig. 1a. Unrolling the recursive structure to a
sequential implementation, we obtain a backbone for object
detector that looks at the images twice or more. Similar to
the cascaded detector heads in Cascade R-CNN trained with
more selective examples
, our RFP recursively enhances FPN
to generate increasingly powerful representations.

At the micro level, we propose Switchable Atrous Convolut
ion (SAC), which convolves the same input feature with
different atrous rates [11,30,53] and gathers the results using
switch functions. Fig. 1b shows an illustration of the concept of SAC. The switch functions are spatially dependent,
i.e., each location of the feature map might have different
switches to control the outputs of SAC. To use SAC in the
detector, we convert all the standard 3x3 convolutional layers in the bottom-up backbone to SAC, which improves the
detector performance by a large margin. Some previous
methods adopt conditional convolution, e.g., [39, 74], which
also combines results of different convolutions as a single
output. Unlike those methods whose architecture requires
to be trained from scratch, SAC provides a mechanism to
easily convert pretrained standard convolutional networks
(e.g., ImageNet-pretrained [59] checkpoints).
Moreover, a
new weight locking mechanism is used in SAC where the
weights of different atrous convolutions are the same except
for a trainable difference.
Combining the proposed RFP and SAC results in our DetectoRS. To demonstrate its effectiveness, we incorporate
DetectoRS into the state-of-art HTC [7] on the challenging
COCO dataset [47].

2、baseline:HTC介绍

在这里插入图片描述 HTC的主要思想:
通过在每个阶段结合级联和多任务来改善信息流,并利用空间背景来进一步提高准确性。整个网络是多任务多阶段的混合级联结构,训练时每个 stage 内 box 和 mask 分支采用交替执行,并在不同 stage 的 mask 分支之间引入直接的信息流。

总结:
(1)多任务多阶段的混合级联结构
(2)训练时每个 stage 内 box 和 mask 分支采用交替执行
(3)在不同 stage 的 mask 分支之间引入直接的信息流
(4)语义分割的特征和原始的 box/mask 分支融合,增强 spatial context(图d的s模块)

参考:实例分割的进阶三级跳:从 Mask R-CNN 到 Hybrid Task Cascade

3、ASPP模块介绍

使用带有空洞卷积的空间金字塔池化(ASPP)模块来实现两个递归特征金字塔的级联连接,该连接模块以其特征为输入并将其转换为Figure3中使用的RFP的特征(RFP Feature)。
在这里插入图片描述
本文中有所修改去除了1x1卷积分支。

本文中的ASPP有四个并行分支对其输入进行扩展,然后将它们的输出沿通道维连接在一起,以形成的最终输出。它们的三个分支使用卷积层,然后是ReLU层,输出通道数是输入通道数的1/4。最后一个分支使用全局平均池化层来压缩特征,然后使用1x1卷积层和ReLU层将压缩后的特征转换为1/4尺寸(逐通道)的特征。最后,将其调整大小并与其他三个分支的特征进行连接。三个空洞模块的配置:kernel size = [1, 3, 3], atrous rate =[1, 3, 6], padding = [0, 3, 6]。四个分支中的每个分支都产生一个具有输入特征通道数量1/4的通道的特征,将它们连接起来将生成与RFP的输入特征尺寸相同的特征。

4、RFP模块的具体实现

在这里插入图片描述
在这里插入图片描述

在这里插入图片描述

Figure2显示的RFP模块的整体流程, 其中第一次FPN出来的feature map要经过ASPP模块转换之后就是figure3所示的RFP Features,Figure3上方所显示的就是以resnet 为基础bottom up的原始结构,然后和ASPP出来的RFP Features进行相加,这个就是作者提出的RFP模块在融合进Resnet时的具体操作,最后第二次出来的feature map和第一次出来的feature map进行融合时需要根据figure5所示的操作进行,作者提到这是借鉴了RNN。

We make changes to the ResNet [28] backbone B to
allow it to take both x and R(f) as its input. ResNet has four
stages, each of which is composed of several similar blocks.
We only make changes to the first block of each stage, as
shown in Fig. 3. This block computes a 3-layer feature and
adds it to a feature computed by a shortcut. To use the feature
R(f), we add another convolutional layer with the kernel
size set to 1. The weight of this layer is initialized with 0 to
make sure it does not have any real effect when we load the
weights from a pretrained checkpoint.

We use Atrous Spatial Pyramid Pooling (ASPP) [12] to
implement the connecting module R, which takes a feature
f ti as its input and transforms it to the RFP feature used
in Fig. 3. In this module, there are four parallel branches
that take f ti as their inputs, the outputs of which are then
concatenated together along the channel dimension to form
the final output of R. Three branches of them use a convolutional layer followed by a ReLU layer, the number of
the output channels is 1/4 the number of the input channels. The last branch uses a global average pooling layer to
compress the feature, followed by a 1x1 convolutional layer
and a ReLU layer to transform the compressed feature to
a 1/4-size (channel-wise) feature. Finally, it is resized and
concatenated with the features from the other three branches.

5、SAC模块的具体实现

在这里插入图片描述
SAC的数学表达式:
在这里插入图片描述注:开关函数的实现是通过5乘5的卷积核,然后跟一个1乘1卷积层实现的。

下面是Figure4图中那个锁的介绍:

We propose a locking mechanism by setting one weight
as w and the other as w + ∆w for the following reasons.
Object detectors usually use pretrained checkpoints to initialize the weights. However, for an SAC layer converted
from a standard convolutional layer, the weight for the larger
atrous rate is missing. Since objects at different scales can
be roughly detected by the same weight with different atrous
rates, it is natural to initialize the missing weights with those
in the pretrained model. Our implementation uses w + ∆w
for the missing weight where w is from the pretrained checkpoint and ∆w is initialized with 0. When fixing ∆w = 0,
we observe a drop of 0.1% AP. But ∆w alone without the
locking mechanism degrades AP a lot.

(1)这个 locking mechanism就是作者提出的将IMAGEnet上的预训练模型与SAC模块相结合,这样就不用将自己的backbone从头开始训练,有可以利用的预训练模型,那些空洞卷积不为1的新加进来的卷积模块权重先暂时用预训练模型里的,但是给他们一个detaW 让他们同时也可以学习。

(2)SAC中锁定机制,通过将一个权重设置为w而另一个权重设置为w +Δw,其原因如下:目标检测器通常使用预训练的checkpoint来初始化权重。但是,对于从标准卷积层转换而来的SAC层,较大的空洞率rate的权重通常是缺失的。由于可以通过相同的权重以不同的粗略度粗略地检测出不同比例的物体,因此用预训练模型中的权重来初始化丢失的权重是可以的。本文使用w + ∆w表示从预训练checkpoint开始的缺失的权重,并使用0初始化∆wi。当固定Δw= 0时,通过实验观察到AP下降了0.1%,但是没有锁定机制的∆w会使AP降低很多。

Figure 4中 Pre-Global Context 的作者的解释:

As shown in Fig. 4, we insert two global context modules
before and after the main component of SAC. These two
modules are light-weighted as the input features are first
compressed by a global average pooling layer. The global
context modules are similar to SENet [31] except for two
major differences:
(1) we only have one convolutional layer
without any non-linearity layers, and (2) the output is added
back to the main stream instead of multiplying the input by
a re-calibrating value computed by Sigmoid.
Experimentally, we found that adding the global context information
before the SAC component (i.e., adding global information
to the switch function) has a positive effect on the detection
performance. We speculate that this is because S can make
more stable switching predictions when global information
is available. We then move the global information outside
the switch function and place it before and after the major
body so that both Conv and S can benefit from it. We did
not adopt the original SENet formulation as we found no
improvement on the final model AP. In the ablation study in
Sec. 5, we show the performances of SAC with and without
the global context modules.

6、SAC与条件卷积的区别

SAC可以利用ImageNet上的预训练权重(因为卷积核大小一样,但是空间卷积的rate不一样),并且在空洞卷积前后加入了全局平均池化获取全局信息;有个锁机制,可以将不同rate的空洞卷积的权重关联起来。

Conditional convolutional networks adopt dynamic kernels, widths, or depths, e.g.,
[16,39,43,48,74,77]. Unlike them, our proposed Switchable
Atrous Convolution (SAC) allows an effective conversion
mechanism from standard convolutions to conditional convolutions without changing any pretrained models.
SAC is thus a plug-and-play module for many pretrained backbones.
Moreover, SAC uses global context information and a novel
weight locking mechanism to make it more effective.

7、SAC中的global context与SENet中的不同

These two modules are light-weighted as the input features are first
compressed by a global average pooling layer. The global
context modules are similar to SENet [31] except for two
major differences:
(1) we only have one convolutional layer
without any non-linearity layers, and
(2) the output is added
back to the main stream instead of multiplying the input by
a re-calibrating value computed by Sigmoid.
Experimentally, we found that adding the global context information
before the SAC component (i.e., adding global information
to the switch function) has a positive effect on the detection
performance.

8、 Ablation Studies

在这里插入图片描述

在这里插入图片描述
注:DS表示Dual Switch,即两个独立的开关函数,S1(x)和S2(x),而不是S1(x)和1-S1(x)

For RFP, we show ‘RFP + sharing’ where B1i and B2i
share their weights.

9、 State-of-the-art comparison on COCO test-dev

在这里插入图片描述

10、SAC和RFP的优势(可视化结果)

在这里插入图片描述

Fig. 6 provides visualization of the results by HTC, ‘HTC+ RFP’ and ‘HTC + SAC’.
From this comparison, we notice
that RFP, similar to human visual perception that selectively
enhances or suppresses neuron activations, is able to find
occluded objects more easily for which the nearby context
information is more critical. SAC, because of its ability
to increase the field-of-view as needed, is more capable of
detecting large objects in the images. This is also consistent
with the results of SAC shown in Tab. 2 where it has a higher
APL.

参考文献

1、重磅开源!目标检测新网络 DetectoRS:54.7 AP,特征金字塔与空洞卷积的完美结合

2、详解目标检测新网络 DetectoRS:54.7 AP,特征金字塔与空洞卷积的完美结合

3、实例分割的进阶三级跳:从 Mask R-CNN 到 Hybrid Task Cascade

这篇关于论文阅读:DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/167657

相关文章

C++20中的Feature Test Mocros

C++20定义了一组预处理器宏,用于测试各种语言和库的feature。       Feature Test Mocros(特性测试宏)是C++20中引入的一种强大机制,用于应对兼容性问题。Feature Test Mocros作为预处理器指令(preprocessor directives)出现,它使你能够在编译过程中仔细检查特定语言或库功能(particular language

ssh在本地虚拟机中的应用——解决虚拟机中编写和阅读代码不方便问题的一个小技巧

虚拟机中编程小技巧分享——ssh的使用 事情的起因是这样的:前几天一位工程师过来我这边,他看到我在主机和虚拟机运行了两个vscode环境,不经意间提了句:“这么艰苦的环境写代码啊”。 后来我一想:确实。 我长时间以来都是直接在虚拟机里写的代码,但是毕竟是虚拟机嘛,有时候编辑器没那么流畅,在文件比较多的时候跳转很麻烦,容易卡住。因此,我当晚简单思考了一下,想到了一个可行的解决方法——即用ssh

康奈尔大学之论文审稿模型Reviewer2及我司七月对其的实现(含PeerRead)

前言 自从我司于23年7月开始涉足论文审稿领域之后「截止到24年6月份,我司的七月论文审稿GPT已经迭代到了第五版,详见此文的8.1 七月论文审稿GPT(从第1版到第5版)」,在业界的影响力越来越大,所以身边朋友如发现业界有相似的工作,一般都会第一时间发给我,比如本部分要介绍的康奈尔大学的reviewer2 当然,我自己也会各种看类似工作的论文,毕竟同行之间的工作一定会互相借鉴的,我们会学他们

芯片后端之 PT 使用 report_timing 产生报告如何阅读

今天,就PT常用的命令,做一个介绍,希望对大家以后的工作,起到帮助作用。 在PrimeTime中,使用report_timing -delay max命令生成此报告。switch -delay max表示定时报告用于设置(这是默认值)。 首先,我们整体看一下通过report_timing 运行之后,报告产生的整体样式。 pt_shell> report_timing -from start_

【论文精读】分类扩散模型:重振密度比估计(Revitalizing Density Ratio Estimation)

文章目录 一、文章概览(一)问题的提出(二)文章工作 二、理论背景(一)密度比估计DRE(二)去噪扩散模型 三、方法(一)推导分类和去噪之间的关系(二)组合训练方法(三)一步精确的似然计算 四、实验(一)使用两种损失对于实现最佳分类器的重要性(二)去噪结果、图像质量和负对数似然 论文:Classification Diffusion Models: Revitalizing

【python】python葡萄酒国家分布情况数据分析pyecharts可视化(源码+数据集+论文)【独一无二】

👉博__主👈:米码收割机 👉技__能👈:C++/Python语言 👉公众号👈:测试开发自动化【获取源码+商业合作】 👉荣__誉👈:阿里云博客专家博主、51CTO技术博主 👉专__注👈:专注主流机器人、人工智能等相关领域的开发、测试技术。 python葡萄酒国家分布情况数据分析pyecharts可视化(源码+数据集+论文)【独一无二】 目录 python葡

论文阅读--Efficient Hybrid Zoom using Camera Fusion on Mobile Phones

这是谷歌影像团队 2023 年发表在 Siggraph Asia 上的一篇文章,主要介绍的是利用多摄融合的思路进行变焦。 单反相机因为卓越的硬件性能,可以非常方便的实现光学变焦。不过目前的智能手机,受制于物理空间的限制,还不能做到像单反一样的光学变焦。目前主流的智能手机,都是采用多摄的设计,一般来说一个主摄搭配一个长焦,为了实现主摄与长焦之间的变焦,目前都是采用数字变焦的方式,数字变焦相比于光学

【LLM之KG】CoK论文阅读笔记

研究背景 大规模语言模型(LLMs)在许多自然语言处理(NLP)任务中取得了显著进展,特别是在零样本/少样本学习(In-Context Learning, ICL)方面。ICL不需要更新模型参数,只需利用几个标注示例就可以生成预测。然而,现有的ICL和链式思维(Chain-of-Thought, CoT)方法在复杂推理任务上仍存在生成的推理链常常伴随错误的问题,导致不真实和不可靠的推理结果。

【python】python基于akshare企业财务数据对比分析可视化(源码+数据集+论文)【独一无二】

👉博__主👈:米码收割机 👉技__能👈:C++/Python语言 👉公众号👈:测试开发自动化【获取源码+商业合作】 👉荣__誉👈:阿里云博客专家博主、51CTO技术博主 👉专__注👈:专注主流机器人、人工智能等相关领域的开发、测试技术。 系列文章目录 目录 系列文章目录一、设计要求二、设计思路三、可视化分析 一、设计要求 选取中铁和贵州茅

AIGC-Animate Anyone阿里的图像到视频 角色合成的框架-论文解读

Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation 论文:https://arxiv.org/pdf/2311.17117 网页:https://humanaigc.github.io/animate-anyone/ MOTIVATION 角色动画的