[论文精读]Do Transformers Really Perform Bad for Graph Representation?

本文主要是介绍[论文精读]Do Transformers Really Perform Bad for Graph Representation?,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

论文网址:[2106.05234] Do Transformers Really Perform Bad for Graph Representation? (arxiv.org)

论文代码:https://github.com/Microsoft/Graphormer

英文是纯手打的!论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误,若有发现欢迎评论指正!文章偏向于笔记,谨慎食用!

1. 省流版

1.1. 心得

1.2. 论文总结图

2. 论文逐段精读

2.1. Abstract

        ①Transformer did not achieve ideal performance comparing with mainstream GNN variants

        ②The authors put forward Graphormer to change this situation

leaderboard  n. 排行榜;通栏广告

2.2. Introduction

        ①Graphormer performs outstanding on Open Graph Benchmark Large-Scale Challenge (OGB-LSC), and several popular leaderboards such as OGB and Benchmarking-GNN 

        ②Transformer only takes node similarity into consideration, whereas dose not focus on structural relationship. Ergo, Graphormer add structural encoding

        ③They capture node importance by Centrality Encoding, extract centrality by Degree centrality and present structural relationship by Spatial Encoding

        ④Graphormer occupies the top spot on the OGB-LSC, MolHIV, MolPCBA, ZINC and other rankings

de-facto  实际上的:指在实际上拥有某种地位或权力,而不是在法律上或正式上拥有

canonical  adj. 根据教规的,按照宗教法规的;真经的,正经的;标准的,典范的;准确的,权威的;公认的,依据科学法则的;(数学表达式)最简洁的;(与)公理(或标准公式)(有关)的;(与)教会(或教士)(有关)的

2.3. Preliminary

(1)Graph Neural Network (GNN)

        ①Presenting graph as G=\left ( V,E \right ), where V=\{v_{1},v_{2},\cdots,v_{n}\} denotes the node set, n=\left | V \right | denotes the number of nodes. Define the feature vector of v_i named x_i and node representation of v_i at the l-th layer is h_i^{\left ( l \right )}h_i^{\left ( 0 \right )}=x_i

        ②The usual GNN is representated as:

a_i^{(l)}=\text{AGGREGATE}^{(l)}\left(\left\{h_j^{(l-1)}:j\in\mathcal{N}(v_i)\right\}\right)\\\quad h_i^{(l)}=\text{COMBINE}^{(l)}\left(h_i^{(l-1)},a_i^{(l)}\right)\\h_G=\text{READOUT}\left(\left\{h_i^{(L)}\mid v_i\in G\right\}\right)

where \mathcal{N}(v_i) denotes the neighbors (unknow hops) of v_i

(2)Transformer

        ①Each layer in Transformer contains a self-attention module and a position-wise feed-forward network (FFN)

        ②The input of self-attention module is {H}=\left[h_{1}^{\top},\cdots,h_{n}^{\top}\right]^{\top}\in\mathbb{R}^{n\times d}, where d represents the hidden dimension, h_{i}\in\mathbb{R}^{1\times d} denotes the hidden representation at position i

        ③The function of attention mechanism:

\begin{aligned}Q&=HW_Q,W_{Q}\in\mathbb{R}^{d\times d_{K}} \quad K=HW_K,W_{K}\in\mathbb{R}^{d\times d_{K}} \quad V=HW_V,W_{V}\in\mathbb{R}^{d\times d_{V}} \\A&=\frac{QK^\top}{\sqrt{d_K}},\quad\mathrm{Attn}\left(H\right)=\mathrm{softmax}\left(A\right)V\end{aligned}

where A is a similarity matrix of queries and keys

        ④They apply simple single-head self-attention mechanism and define d_K=d_V=d. Moreover, they eliminate bias in multi-head attenton part

2.4. Graphormer

2.4.1. Structural Encodings in Graphormer

        The overall framework of Graphormer, which contains three modules:

(1)Centrality Encoding

        ①For directed graph, their centrality encoding for input will be:

h_{i}^{(0)}=x_{i}+z_{\deg^{-}(v_{i})}^{-}+z_{\deg^{+}(v_{i})}^{+}

where z_{\deg^{-}(v_{i})}^{-}\in \mathbb{R}^d is the learnable embedding vector of indegree \deg^{-}(v_{i})z_{\deg^{+}(v_{i})}^{+}\in \mathbb{R}^d is the learnable embedding vector of outdegree \deg^{+}(v_{i})呃呃我现在不太能想象z是个什么样的玩意儿

        ②For undirected graph, just one \deg(v_{i}) replaces \deg^{+}(v_{i}) and \deg^{-}(v_{i})

(2)Spatial Encoding

        ①There is no sequence in graph presentation. To this end, they provide a new spatial encoding method to present spatial relations between v_i and v_j:

{\phi\left(v_{i},v_{j}\right):V\times V\rightarrow\mathbb{R}}

where \phi\left(v_{i},v_{j}\right) they choose there is the shortest path (SPD). If there is no path, then set value as -1.

        ②⭐Assigning a learnable scalar to each feasible output value as bias term in self-attention part(鼠鼠注意力学得太菜了捏

        ③The Q-K product matrix A can be calculated by:

A_{ij}=\frac{(h_iW_Q)(h_jW_K)^T}{\sqrt{d}}+b_{\phi(v_i,v_j)}

where b_{\phi(v_i,v_j)} denotes learnable scalar

        ④Paraphrase不动了这里上一句中文,文中的意思大概是上式比起传统的GNN可以体现全局视野,每个节点都开天眼。然后还能体现一下学习性,作者举例说如果b_{\phi(v_i,v_j)}\phi\left(v_{i},v_{j}\right)负相关的话可能每个节点会更在意邻近节点。

(3)Edge Encoding in the Attention

        ①In the previous works, adding edge features into corresponding node features or adding edge features and aggregated node features into corresponding node features are two traditional edge encoding methods. However, it is too superficial and limited in that it can just express the adjacent relationships rather than global relationships.

        ②The SP of each pair node can be SP_{ij}=\left ( e_1,e_2,...,e_N \right )

        ③They calculate the average of the dot-products of the edge feature x_{e_{n}} in the n-th edge e_n and a learnable embedding weight w_{n}^{E}\in\mathbb{R}^{d_{E}} in the n-th along the path:

c_{ij}=\frac{1}{N}\sum_{n=1}^{N}x_{e_{n}}(w_{n}^{E})^{T}

where d_E denotes the dimensionality of edge feature

2.4.2. Implementation Details of Graphormer

(1)Graphormer Layer

        ①The move the layer normalization (LN) module from behind the multi-head self-attention (MHA) and the feed-forward blocks (FFN) module to before them

        ②They define the same dimension of input, output and the inner-layer in FFN sub-layer

        ③The functions of Graphormer Layer:

\begin{aligned} &{h}^{'}(l) =\mathrm{MHA}(\mathrm{LN}(h^{(l-1)}))+h^{(l-1)} \\ &h^{(l)} =\mathrm{FFN}(\mathrm{LN}(h^{'(l)}))+h^{'(l)} \end{aligned}

(2)Special Node

        ①⭐They added a special node [VNode] in their model, which connects with all the other nodes. With this, the whole graph feature can be the node feature of [VNode]. [CLS] is the similar token in BERT

        ②为啥???为啥[VNode]到所有节点的最短路径是1??怎么SPD里面没有小数吗??

        ③The connections between [VNode] and all the other nodes are virtual instead of physical. To distinguish this, they reset b_{\phi([\mathrm{VNode}],v_j)} and b_{\phi(v_{i},[\mathrm{VNode}])} to new learnable scalars

2.4.3. How Powerful is Graphormer?

(1)Fact 1. By choosing proper weights and distance function φ, the Graphormer layer can represent AGGREGATE and COMBINE steps of popular GNN models such as GIN, GCN, GraphSAGE

        ①Spatial encoder presents the \mathcal{N}\left(v_{i}\right) of node v_i, which enables Softmax to calculate mean statistics over \mathcal{N}\left(v_{i}\right)啥是平均统计量??统计量是怎么个定义法?节点距离?节点特征?

        ②They can translate mean over neighbors to sum over neighbors while knowing the degree of a node(emmm...emmm?

        ③The process of \mathcal{N}\left(v_{i}\right) and v_i can be arbitrarily separated or combined(完,全军覆没,这三点一点没看懂

        ④The expression ability of Graphormer goes beyond GNNs and 1-Weisfeiler-Lehman (WL) test

(2)Fact 2. By choosing proper weights, every node representation of the output of a Graphormer layer without additional encodings can represent MEAN READOUT functions

        ①They add a super node that can aggregates the information of the whole graph (virtual node heuristic). However, simplily adding super node may cause over-smoothing in information propagation part

        ②Global perspective of all nodes that takes from self-attention mechanism is able to simulate graph-level READOUT operation

        ③Empirically, their Graphormer will not face over-smoothing problem

2.5. Experiments

        They challenge their model on PCQM4M-LSC with 3.8 M graphs, ogbg-molhiv, ogbg-molpcba and ZINC

2.5.1. OGB Large-Scale Challenge

(1)Baselines

        ①They compared Graphormer with GCN, GIN, GCN-VN (with virtual node), GIN-VN, multi-hop variant of GIN and 12-layer deep graph network DeeperGCN

        ②Transformer based GNNs are compared as well

(2)Settings

        ①Two model size: Graphormer (L=12,d=768) and Graphormer_SMALL (L=6,d=512)

        ②Number of attention heads/dimensionality of edge features: d_E=32 in both models

        ③⭐Optimizer: AdamW(哇哇哇好少见的换了一个)

        ④Hyper-parameter: \epsilon =10^{-8},\beta _1=0.99,\beta _2=0.999

        ⑤Peak learning rate: 2 \times 10^{-4} for Graphormer and 3 \times 10^{-4} for Graphormer_SMALL

        ⑥Warm-up stage: 60k-step

(3)Results

        ①Comparison table:

where GT-Wide is GT with hidden dimension=768, the original of GT equals to 64

2.5.2. Graph Representation

        ①They further explore the performance of Graphormer on OGBG-MolPCBA, OGBG-MolHIV and benchmarking-GNN (ZINC)

        ②They use different parameters in this section. Details showed in the appendix.

        ③They designed another model Graphormer_SLIM (L=12,d=80,total\, param=489K) for ZINC

(1)Baselines

        ①For fair competition on OGB, they adopt the pre-trained GIN-VN after fine-tuned 

(2)Settings

        ①Graphormer there is more prone to overfitting problems when model size is large and dataset size is small. To this end, they apply FLAG, a data augmentation method, to mitigare this problem

(3)Results

        ①Comparison table on OGBG-MolPCBA:

        ②Comparison table on OGBG-MolHIV:

        ③Comparison table on ZINC:

2.5.3. Ablation Studies

(1)Node Relation Encoding

        ①They compared traditional positional encoding (PE) with their spatial encoding

(2)Centrality Encoding

        ①Degree-based centrality encoding brings significant increasement of performance

(3)Edge Encoding

        ①They denote their edge encoding as via attn bias, and compare it with two traditional edge encoding methods, via node and via Aggr

        ②Comparison table:

2.6. Related Work

2.6.1. Graph Transformer

        ①One type of GT changes several parts in transformer layer, then pre-trains their model on 10 million unlabelled molecules and fine-tunes the down-stream classification task

        ②Introducing different variants of DT

2.6.2. Structural Encodings in GNNs

(1)Path and Distance in GNNs

        ①你不把模型名字说出来你就说[9]提出了啥方法我是真的不想做笔记。“与[9]类似,[56]利用基于路径的注意力来模拟中心节点与其高阶邻居之间的影响;[59]提出了一种图上距离加权聚合方案;文献[32]已经证明,采用距离编码(即将距离的一个热点特征作为额外的节点属性)可以获得比1-WL测试更强的表达能力。”414

(2)Positional Encoding in Transformer on Graph

        ①Introducing some models which introduce positional encoding techniques

(3)Edge Feature

        ①Introducing some models which contains special edge feature extraction methods

2.7. Conclusion

        Combined the three new methods, Graphormer achieve excellent performance in competitions. However, the "quadratic complexity of the self-attention module restricts Graphormer’s application on large graphs"

2.8. Proofs

2.8.1. SPD can Be Used to Improve WL-Test

2.8.2. Proof of Fact 1

2.8.3. Proof of Fact 2

2.9. Experiment Details

2.9.1. Details of Datasets

2.9.2. Details of Training Strategies

(1)PCQM4M-LSC

(2)OGBG-MolPCBA

(3)OGBG-MolHIV

(4)ZINC

2.9.3. Details of Hyper-parameters for Baseline Methods

(1)PCQM4M-LSC

(2)OGBG-MolPCBA

(3)OGBG-MolHIV

2.10. More Experiments

2.11. Discussion & Future Work

3. 知识补充

3.1. Permutation invariant

参考学习1:【神经网络】Deep Sets:满足置换不变性(permutation-invariant)与置换同变性(permutation-equivariant)的通用网络结构 - 知乎 (zhihu.com)

3.2. Super node

文心:

在图论中,超节点(Super Node)或超级顶点(Super Vertex)是指那些具有极多相邻边的节点。这些相邻边可以是出边(从该节点指向其他节点的边)或者是入边(指向该节点的边)。这些节点由于它们的高连接性,常常在网络中扮演着重要的角色。超节点现象在具有幂律分布特性的网络中尤为常见,这意味着少数节点拥有大量的连接,而大多数节点只有少量的连接。例如,在社交网络中,拥有大量粉丝的网红或名人账户就可以被视为超节点,因为他们的账户与大量的其他账户相连。

3.3. Warm-up stage

文心:

深度学习的预热阶段,也称为学习率预热(Learning Rate Warmup)或warmup stage,是一种在模型训练初期采用的策略。其主要目的是在开始阶段使用较小的学习率,然后逐渐增大学习率至预设值,从而使模型能够逐渐适应训练过程并趋于稳定。

在深度学习训练中,模型的权重是随机初始化的。如果在训练初期直接使用较大的学习率,可能会导致模型的不稳定,甚至振荡。预热阶段的引入,可以在开始训练的几个epoch或一些steps内,使用较小的学习率,让模型有机会逐渐适应并趋于稳定。然后,随着预热阶段的结束,学习率会逐渐增大至预设值,以加快模型的收敛速度并提高模型的效果。

预热阶段的长度可以根据具体的任务和数据集进行调整。例如,在ResNet论文中,预热阶段被设置为前5个epoch,每个epoch包含15000个steps。预热阶段的具体实现方式也可以有所不同,如线性预热、余弦预热等。

需要注意的是,虽然预热阶段可以提高模型的稳定性和训练效果,但较低的学习率也可能导致训练过程变得非常缓慢。因此,在实际应用中,需要根据具体情况来平衡预热阶段的长度和学习率的变化范围。

3.4. Quadratic complexity

文心:

在自注意力机制(Self-Attention Mechanism)中,"quadratic complexity" 指的是自注意力计算的时间复杂度和空间复杂度都是二次的,即与输入序列的长度的平方成正比。

自注意力机制的核心是计算输入序列中每个位置的表示与其他所有位置的表示的加权和。这通常通过计算一个查询(query)、键(key)和值(value)的加权和来实现,其中查询、键和值都是输入序列的线性变换。具体来说,对于输入序列中的每个位置,都会生成一个查询向量、一个键向量和一个值向量。然后,通过将每个位置的查询向量与所有位置的键向量进行点积,并应用softmax函数来得到权重,再用这些权重对值向量进行加权求和,从而得到该位置的输出表示。

由于这个过程涉及到对每个位置的查询向量与所有位置的键向量进行点积,因此时间复杂度和空间复杂度都是二次的。假设输入序列的长度为n,那么自注意力机制的时间复杂度和空间复杂度都是O(n^2)。这在处理长序列时可能会成为问题,因为计算资源和内存使用会随着序列长度的增加而快速增加。

为了解决这个问题,一些研究工作提出了各种方法来降低自注意力机制的计算复杂度和内存使用,例如使用近似算法、局部注意力机制、稀疏注意力机制等。这些方法的目标是在保持模型性能的同时,减少计算资源和内存的使用。

4. Reference List

Ying C. et al. (2021) 'Do Transformers Really Perform Bad for Graph Representation?', NeurIPS 2021. doi: https://doi.org/10.48550/arXiv.2106.05234

这篇关于[论文精读]Do Transformers Really Perform Bad for Graph Representation?的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/739438

相关文章

AI hospital 论文Idea

一、Benchmarking Large Language Models on Communicative Medical Coaching: A Dataset and a Novel System论文地址含代码 大多数现有模型和工具主要迎合以患者为中心的服务。这项工作深入探讨了LLMs在提高医疗专业人员的沟通能力。目标是构建一个模拟实践环境,人类医生(即医学学习者)可以在其中与患者代理进行医学

论文翻译:arxiv-2024 Benchmark Data Contamination of Large Language Models: A Survey

Benchmark Data Contamination of Large Language Models: A Survey https://arxiv.org/abs/2406.04244 大规模语言模型的基准数据污染:一项综述 文章目录 大规模语言模型的基准数据污染:一项综述摘要1 引言 摘要 大规模语言模型(LLMs),如GPT-4、Claude-3和Gemini的快

论文阅读笔记: Segment Anything

文章目录 Segment Anything摘要引言任务模型数据引擎数据集负责任的人工智能 Segment Anything Model图像编码器提示编码器mask解码器解决歧义损失和训练 Segment Anything 论文地址: https://arxiv.org/abs/2304.02643 代码地址:https://github.com/facebookresear

论文翻译:ICLR-2024 PROVING TEST SET CONTAMINATION IN BLACK BOX LANGUAGE MODELS

PROVING TEST SET CONTAMINATION IN BLACK BOX LANGUAGE MODELS https://openreview.net/forum?id=KS8mIvetg2 验证测试集污染在黑盒语言模型中 文章目录 验证测试集污染在黑盒语言模型中摘要1 引言 摘要 大型语言模型是在大量互联网数据上训练的,这引发了人们的担忧和猜测,即它们可能已

OmniGlue论文详解(特征匹配)

OmniGlue论文详解(特征匹配) 摘要1. 引言2. 相关工作2.1. 广义局部特征匹配2.2. 稀疏可学习匹配2.3. 半稠密可学习匹配2.4. 与其他图像表示匹配 3. OmniGlue3.1. 模型概述3.2. OmniGlue 细节3.2.1. 特征提取3.2.2. 利用DINOv2构建图形。3.2.3. 信息传播与新的指导3.2.4. 匹配层和损失函数3.2.5. 与Super

BERT 论文逐段精读【论文精读】

BERT: 近 3 年 NLP 最火 CV: 大数据集上的训练好的 NN 模型,提升 CV 任务的性能 —— ImageNet 的 CNN 模型 NLP: BERT 简化了 NLP 任务的训练,提升了 NLP 任务的性能 BERT 如何站在巨人的肩膀上的?使用了哪些 NLP 已有的技术和思想?哪些是 BERT 的创新? 1标题 + 作者 BERT: Pre-trainin

[论文笔记]LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

引言 今天带来第一篇量化论文LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale笔记。 为了简单,下文中以翻译的口吻记录,比如替换"作者"为"我们"。 大语言模型已被广泛采用,但推理时需要大量的GPU内存。我们开发了一种Int8矩阵乘法的过程,用于Transformer中的前馈和注意力投影层,这可以将推理所需

图神经网络框架DGL实现Graph Attention Network (GAT)笔记

参考列表: [1]深入理解图注意力机制 [2]DGL官方学习教程一 ——基础操作&消息传递 [3]Cora数据集介绍+python读取 一、DGL实现GAT分类机器学习论文 程序摘自[1],该程序实现了利用图神经网络框架——DGL,实现图注意网络(GAT)。应用demo为对机器学习论文数据集——Cora,对论文所属类别进行分类。(下图摘自[3]) 1. 程序 Ubuntu:18.04

2024 年高教社杯全国大学生数学建模竞赛 C 题 农作物的种植策略 参考论文 无水印

持续更新中,2024年数学建模比赛思路代码论文都会发布到专栏内,只需订阅一次!  完整论文+代码+数据结果链接在文末!  订阅后可查看参考论文文件 第一问 1.1 问题重述 这个问题围绕的是华北山区的某乡村,在有限的耕地条件下,如何制定最优的农作物种植策略。乡村有 34 块露天耕地和 20 个大棚,种植条件包括粮食作物、蔬菜、水稻和食用菌。除了要考虑地块的面积、种植季节等,还要确保

论文精读-Supervised Raw Video Denoising with a Benchmark Dataset on Dynamic Scenes

论文精读-Supervised Raw Video Denoising with a Benchmark Dataset on Dynamic Scenes 优势 1、构建了一个用于监督原始视频去噪的基准数据集。为了多次捕捉瞬间,我们手动为对象s创建运动。在高ISO模式下捕获每一时刻的噪声帧,并通过对多个噪声帧进行平均得到相应的干净帧。 2、有效的原始视频去噪网络(RViDeNet),通过探