Vision Transformer with Sparse Scan Prior

2024-06-18 21:04

本文主要是介绍Vision Transformer with Sparse Scan Prior,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

摘要

https://arxiv.org/pdf/2405.13335v1
In recent years, Transformers have achieved remarkable progress in computer vision tasks. However, their global modeling often comes with substantial computational overhead, in stark contrast to the human eye’s efficient information processing. Inspired by the human eye’s sparse scanning mechanism, we propose a Sparse Scan Self-Attention mechanism ( \left.\mathrm{S}^{3} \mathrm{~A}\right) . This mechanism predefines a series of Anchors of Interest for each token and employs local attention to efficiently model the spatial information around these anchors, avoiding redundant global modeling and excessive focus on local information. This approach mirrors the human eye’s functionality and significantly reduces the computational load of vision models. Building on \mathrm{S}^{3} \mathrm{~A} , we introduce the Sparse Scan Vision Transformer (SSViT). Extensive experiments demonstrate the outstanding performance of SSViT across a variety of tasks. Specifically, on ImageNet classification, without additional supervision or training data, SSViT achieves top-1 accuracies of \mathbf{8 4 . 4 % / 8 5 . 7 %} with 4.4G/18.2G FLOPs. SSViT also excels in downstream tasks such as object detection, instance segmentation, and semantic segmentation. Its robustness is further validated across diverse datasets. Code will be available at https:// github. com/qhfan/SSViT.
1 Introduction
Since its inception, the Vision Transformer (ViT) [12] has attracted considerable attention from the research community, primarily owing to its exceptional capability in modeling long-range dependencies. However, the self-attention mechanism [61], as the core of ViT, imposes significant computational overhead, thus constraining its broader applicability. Several strategies have been proposed to alleviate this limitation of self-attention. For instance, methods such as Swin-Transformer [40, 11] group tokens for attention, reducing computational costs and enabling the model to focus more on local information. Techniques like PVT [63,64,18,16,29] down-sample tokens to shrink the size of the \mathrm{QK} matrix, thus lowering computational demands while retaining global information. Meanwhile, approaches such as UniFormer [35, 47] forgo attention operations in the early stages of visual modeling, opting instead for lightweight convolution. Furthermore, some models [50] enhance computational efficiency by pruning redundant tokens.
Despite these advancements, the majority of methods primarily focus on reducing the token count in self-attention operations to boost ViT efficiency, often neglecting the manner in which human eyes process visual information. The human visual system operates in a notably less intricate yet highly efficient manner compared to ViT models. Unlike the fine-grained local spatial information modeling in models like Swin [40], NAT [20], LVT [69], or the indistinct global information modeling seen in models like PVT [63], PVTv2 [64], CMT [18], human vision employs a sparse scanning

这篇关于Vision Transformer with Sparse Scan Prior的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/1073049

相关文章

Hbase Filter+Scan 查询效率优化

Hbase Filter+Scan 查询效率问题 众所周知,Hbase利用filter过滤器查询时候会进行全表扫描,查询效率低下,如果没有二级索引,在项目中很多情况需要利用filter,下面针对这种情况尝试了几种优化的方案,仅供参考,欢迎交流。 根据业务要求,作者需要根据时间范围搜索所需要的数据,所以作者设计的rowKey是以时间戳为起始字符串的。 正确尝试: 1.scan 设置 开始行和结

context:component-scan使用说明!

<!-- 使用annotation 自动注册bean, 并保证@Required、@Autowired的属性被注入 --> <context:component-scan base-package="com.yuanls"/> 在xml配置了这个标签后,spring可以自动去扫描base-pack下面或者子包下面的java文件,如果扫描到有@Component @Controll

Transformer从零详细解读

Transformer从零详细解读 一、从全局角度概况Transformer ​ 我们把TRM想象为一个黑盒,我们的任务是一个翻译任务,那么我们的输入是中文的“我爱你”,输入经过TRM得到的结果为英文的“I LOVE YOU” ​ 接下来我们对TRM进行细化,我们将TRM分为两个部分,分别为Encoders(编码器)和Decoders(解码器) ​ 在此基础上我们再进一步细化TRM的

LLM模型:代码讲解Transformer运行原理

视频讲解、获取源码:LLM模型:代码讲解Transformer运行原理(1)_哔哩哔哩_bilibili 1 训练保存模型文件 2 模型推理 3 推理代码 import torchimport tiktokenfrom wutenglan_model import WutenglanModelimport pyttsx3# 设置设备为CUDA(如果可用),否则使用CPU#

逐行讲解Transformer的代码实现和原理讲解:计算交叉熵损失

LLM模型:Transformer代码实现和原理讲解:前馈神经网络_哔哩哔哩_bilibili 1 计算交叉熵目的 计算 loss = F.cross_entropy(input=linear_predictions_reshaped, target=targets_reshaped) 的目的是为了评估模型预测结果与实际标签之间的差距,并提供一个量化指标,用于指导模型的训练过程。具体来说,交叉

复盘高质量Vision Pro沉浸式视频的制作流程与工具

在探索虚拟现实(VR)和增强现实(AR)技术的过程中,高质量的沉浸式体验是至关重要的。最近,国外开发者Dreamwieber在其作品中展示了如何使用一系列工具和技术,创造出令人震撼的Vision Pro沉浸式视频。本文将详细复盘Dreamwieber的工作流,希望能为从事相关领域的开发者们提供有价值的参考。 一、步骤和工作流 构建基础原型 目的:快速搭建起一个基本的模型,以便在设备

深度学习每周学习总结N9:transformer复现

🍨 本文为🔗365天深度学习训练营 中的学习记录博客🍖 原作者:K同学啊 | 接辅导、项目定制 目录 多头注意力机制前馈传播位置编码编码层解码层Transformer模型构建使用示例 本文为TR3学习打卡,为了保证记录顺序我这里写为N9 总结: 之前有学习过文本预处理的环节,对文本处理的主要方式有以下三种: 1:词袋模型(one-hot编码) 2:TF-I

RNN发展(RNN/LSTM/GRU/GNMT/transformer/RWKV)

RNN到GRU参考: https://blog.csdn.net/weixin_36378508/article/details/115101779 tRANSFORMERS参考: seq2seq到attention到transformer理解 GNMT 2016年9月 谷歌,基于神经网络的翻译系统(GNMT),并宣称GNMT在多个主要语言对的翻译中将翻译误差降低了55%-85%以上, G

【0323】Postgres内核之 hash table sequentially search(seq_scan_tables、num_seq_scans)

0. seq scan tracking 我们在这里跟踪活跃的 hash_seq_search() 扫描。 需要这种机制是因为如果扫描正在进行时发生桶分裂(bucket split),它可能会访问两次相同的条目,甚至完全错过某些条目(如果它正在访问同一个分裂的桶中的条目)。因此,如果正在向表中插入数据,我们希望抑制桶分裂。 在当前的使用中,这种情况非常罕见,因此只需将分裂推迟到下一次插入即可。

一键部署Phi 3.5 mini+vision!多模态阅读基准数据集MRR-Benchmark上线,含550个问答对

小模型又又又卷起来了!微软开源三连发!一口气发布了 Phi 3.5 针对不同任务的 3 个模型,并在多个基准上超越了其他同类模型。 其中 Phi-3.5-mini-instruct 专为内存或算力受限的设备推出,小参数也能展现出强大的推理能力,代码生成、多语言理解等任务信手拈来。而 Phi-3.5-vision-instruct 则是多模态领域的翘楚,能同时处理文本和视觉信息,图像理解、视频摘要