Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

本文主要是介绍Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

Oscar Pre-training

Input

Oscar 将每个输入的 image-text pair 都表示为 Word-Tag-Image triple $(w, q, v)$ ，其中 $w$ 为文本的 word embedding 序列， $q$ 为图像中检测出的 object tags (以文本的方式输入模型) 的 word embedding 序列， $v$ 为图像的 region vectors
The object tags are used as anchor points to align image regions with word embeddings of pre-trained language models: Oscar 引入 $q$ 作为 anchor points 可以加强模型 image-text alignment 的能力，该结论是基于以下观察：在 image-text pair 中，图像里出现的重要物体往往也会出现在文本中，并使用与 object tags 相同或近义的词汇。由于 $q$ 和 $w$ 都属于语言模态，因此模型更容易找到它们之间的对齐关系。某个文本的 word embedding 如果与某个 Object tag 比较相似，那么该 word embedding 也应该与 Object tag 对应的 image region 有较大的注意力权重 (dictionary look up)。同时，这个方法也有助于减少图像区域的 Ambiguity，也就是区分 vision space 中十分相似但 language space 中十分不同的物体。送入 V+L 模型的 image regions 通常是 over-sampled，这导致不同图像区域都有较大重合，但从图像上难以区分。如下图 $c$ 所示，couch 和 dog 在 image region features 上十分相似，但在 word semantic space 中十分不同
$v$ 和 $q$ 的具体生成方法如下：给定一张带有 $K$ 个 regions of objects 的图片 (normally over-sampled and noisy)，使用 Faster R-CNN 抽取出每个 region 的特征 $(v^{'}, z)$ ，其中 $v'\in \R^P$ 是一个 $P$ 维向量 ( $P = 2048$ )， $z$ 是一个 $R$ 维向量 ( $R = 4$ or $6$ , It includes coordinates of top-left & bottom-right corners, and/or height & width.). 接着将 $v^{'}$ 和 $z$ 连接后送入全连接层，将其映射到与 word embedding 相同的维数得到 $v$ 。同时，使用同一个 Faster R-CNN 检测一系列的高精度 object tags， $q$ 即为这些 object tags 对应的 word embeddings 的序列

Pre-Training Objective

Oscar 的输入 $(w, q, v)$ 可以从两个角度来看：
其中 $x$ 为 modality view，用于区分文本和图像表示； $x^{'}$ 为 dictionary view，用于区分两个不同的语义空间。这两种不同的视角可以让我们设计出一个全新的预训练目标
A Dictionary View: Masked Token Loss: 设 $h = [w, q]$ 为 discrete token sequence。类似 BERT 中的 masked language model，设置 Masked Token Loss (MTL) 作为预训练任务。在每个迭代中，用 [MASK] 遮盖 $h$ 中 15% 的 tokens，损失函数如下：

关于 Dictionary 的解释：A semantic space can be viewed a vector space deﬁned by a dictionary, which maps an input to a vector representation in the semantic space. For example, BERT can be viewed as a dictionary that deﬁnes a linguistic semantic space. BERT maps an input word or word sequence into a feature vector in the semantic space.

A Modality View: Contrastive Loss: 设 $h^{'} = [q, v]$ 表示图像模态。有 50% 的几率随机从数据集 $\mathcal D$ 中采样一个 tag sequence 来替代原来的 $q$ ，组成 “polluted” image representations。然后在 [CLS] 对应的输出后添加一个 FC 层 $f (.)$ 用于二分类，判断当前 $(w, h^{'})$ 包含的是否是原来的 image representation
The full pre-training objective:
During the cross-modal pre-training, we utilize object tags as the proxy of images to adjust the word embedding space of BERT, where a text is similar to its paired image (or more speciﬁcally, the object tags detected from the image), and dissimilar to the polluted ones.

Pre-training Corpus

We have built the pre-training corpus based on the existing V+L datasets, including COCO, Conceptual Captions (CC), SBU captions, ﬂicker30k, GQA etc… In total, the unique image set is 4.1 million, and the corpus consists of 6.5 million text-tag-image triples, which is less than 9.6 million pairs used for UNITER pre-training and 9.18 million pairs for LXMERT.

Implementation Details

我们预训练了两个模型: $\text{OSCAR}_\text{B}$ 和 $\text{OSCAR}_\text{L}$ ，分别使用 $\text{BERT}_\text{BASE}$ 和 $\text{BERT}_\text{LARGE}$ 的参数进行初始化 (The sequence length of discrete tokens $h$ and region features $v$ are 35 and 50, respectively.)

Adapting to V+L Tasks

Experimental Results & Analysis

Performance Comparison with SoTA

在这里插入图片描述

Qualitative Studies

我们用 $t$ -SNE 对 image region 和 word token 的 features 在 2D map 上进行了可视化。可以看出，在 object tags 的帮助下，不同模态的同一物体之间的距离被大幅缩小 (e.g. person, zebra)，相近语义的物体间的距离也变得更小 (e.g. animal (person, zebra, sheep, bird))
这更加说明了 object tags 对于 alignment learning 的重要性: it plays the role of anchor points in linking and regularizing the cross-modal feature learning.

References

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
The code and pre-trained models are released: https://github.com/microsoft/Oscar

这篇关于Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks的文章就介绍到这儿，希望我们推荐的文章对编程师们有所帮助！