【论文研读】Better Together:Unifying Datalog and Equality Saturation

本文主要是介绍【论文研读】Better Together:Unifying Datalog and Equality Saturation，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

最近研究ReassociatePass整的头大，翻两篇Datalog的论文看看。
今天看的一篇是比较新的文章，23年4月贴到arxiv上的。
本文的主要贡献是提出了egglog,将Datalog和Eqsat结合起来，继承了Datalog的efficient incremental execution, cooperating analysis and lattice

Introduction部分

Datalog主要研究方向：database community practitioners use modern implementations to build program analysis。
附了三篇参考文献：
George Balatsouras and Yannis Smaragdakis. 2016. Structure-Sensitive Points-To Analysis for C and C++. In Static Analysis

23rd International Symposium, SAS 2016, Edinburgh, UK, September 8-10, 2016, Proceedings (Lecture Notes in Computer
Science, Vol. 9837), Xavier Rival (Ed.). Springer, 84–104. https://doi.org/10.1007/978-3-662-53413-7_5

Langston Barrett and Scott Moore. 2022. cclyzer++: Scalable and Precise Pointer Analysis for LLVM. https://galois.com/
blog/2022/08/cclyzer-scalable-and-precise-pointer-analysis-for-llvm/.

Yannis Smaragdakis and Martin Bravenboer. 2010. Using Datalog for Fast and Easy Program Analysis. In Proceedings of
the First International Conference on Datalog Reloaded (Oxford, UK) (Datalog’10). Springer-Verlag, Berlin, Heidelberg,
245–251. https://doi.org/10.1007/978-3-642-24206-9_14

Herbie [Panchekha et al. 2015], a tool that uses EqSat to optimize
floating-point accuracy, relies on unsound rewrites because it lacks the analyses to prove that
certain rewrites are safe (e.g. 𝑥/𝑥 → 1 only if 𝑥 ≠ 0). To combat the unsoundness, Herbie must
validate the results of EqSat and discard them if unsoundness was detected.
上文似乎解决了浮点数的sound问题，以后可以看。

On the Datalog side,
cclyzer++ [Barrett and Moore 2022], a recent points-to analysis system implemented in Datalog
that supports Steensgaard analyses [Steensgaard 1996] for LLVM [Lattner and Adve 2004] resorted
to an ad-hoc implementation of union-find, because the provided implementation of equivalence
relations was too slow.
LLVM中的Steensgaard analyses。

上述分析的目的：Datalog和Eqsat两边各有问题。
Our key insight is that the efficient equational reasoning of EqSat and the rich, composable semantic analyses of Datalog make up for each other’s weaknesses, and unifying the two paradigms brings together—and goes beyond—the best of both worlds.

In fact, spontaneous developments in both communities have already begun converging towards each other: Datalog tools have added efficient
equivalence relations [Nappa et al. 2019], lattices [Madsen et al. 2016; Sahebolamri et al. 2022], and some support for datatypes [Developers [n.d.]], while the EqSat community has recently developed support for conditional rewriting, lattice-based analyses [Cheli 2021; Willsey et al. 2021], and relational pattern matching [Zhang et al. 2022]. We bring this trend to completion and close the gap between EqSat and Datalog.
大把文献，看到头秃。上述论文都是在某个系统上面做扩充，问题在于这些扩充能否解决本文所解决的问题？效果如何？

本文的评估办法：

用egglog实现Steensgaard-style points-to analyses faster and easier to write and compare to Souffle with 4.96x
egglog 能够实现一个更新的更完善的Herbie的EqSat过程。在给定的错误容忍度下能够实现更快的给出结果（未给出倍率）

BackGround

egglog扩充了Datalog以纳入EqSat,我想这就是作者为什么将其命名为egglog而不是eggSat。

Datalog

Each rule is a conjunctive query of the form 𝑄(x) :- 𝑅1(x1), 𝑅2(x2), . . . , 𝑅𝑛(x𝑛) where each x and x𝑖 is a tuple of variables or constants. The atom 𝑄(x) is called the head of the rule, and the atoms 𝑅𝑖 (x𝑖) comprise the body.
定义了连接查询和查询头、查询体。
在这里插入图片描述
介绍Datalog时给了一个很经典的图，这个图和Souffle官方教程中的path很类似，用过Souffle的很熟悉。
定义了ICO（Immediate consequence operator)函数，其实就是一个推导的过程。这个过程其实就是找fixpoint的过程。

extend Datalog with lattice

在这里插入图片描述
把Datalog扩充到支持格的水平，这三篇论文也可以以后阅读。
实质就是将原来的针对变量或常量的函数扩充到针对格映射的函数。

此处介绍了EqSat的优势，也即能够解决上述的因为rewrite造成优化机会丢失的例子。

EqSat介绍

EqSat本质上是将Rewrite和history同时保存然后获得二者的好处。
文章还介绍了一些EqSat的扩展，例如，支持语义分析以增加Soundness。
Multi-pattern，看起来是将pattern的过程并行了，有两篇文章是这方面的工作，但还没有达到更优。
提高e-matching的效率，最近的技术是relational e-matching。
论文：
Yihong Zhang, Yisu Remy Wang, Max Willsey, and Zachary Tatlock. 2022. Relational E-Matching. Proc. ACM Program.
Lang. 6, POPL, Article 35 (jan 2022), 22 pages. https://doi.org/10.1145/3498696
这个技术的问题在于会出现dual representation问题，也即引擎会在e-graph和相关的database representation之间往复。
（需要在相关论文中看这个问题的例子）

Egglog

在这里插入图片描述
egglog看起来好抽象，尤其是括号的使用，看起来是使用了前缀表达式的形式，运算符出现在括号的第一个元素，操作数出现在括号的第二个和第三个元素。
egglog还引用了function。

从前缀表达式角度理解上述代码就不会被复杂的嵌套括号困住。
作者解释了:merge的作用，也即处理格的交会功能，例如此处从1到3有两个路径，一种是1直接到3的边30，一种是1到2到3的路径，长度为20，所以需要定义两种情况的取舍，也即join operator。

上述一段代码，无法直接在egglog中运行，因为支持的指令需要改为（run 1) repeat指运行次数，可以多次验证是否匹配。

Sorts and Equality

unification操作，看起来就是将两个点统一成一个点，然后进行分析。
opaque integer values，这里的opaque很有意思，有晦涩的意思，其实相等变量的识别一直不是显而易见的（不同于常量识别）。

Semantic

Syntax

此部分介绍egglog core，不包含那些例如：merge的语法糖。
· 在这里插入图片描述
结合论文中的文字看这个语法还是比较简单的。
uninterpreted constant 和 interpreted constant的区别：
These uninterpreted constants play a similar role as e-class ids in EqSat or labelled nulls in the chase from the database
literature.
在EqSat中，e-class id（等价类标识符）用于标识相同的表达式或值的等价类。
此处uninterpreted constant看起来像类似于值编号中的编号。

Semantics

首先给出模式的定义：
A schema 𝑆 is a collection of function symbols and their function signatures, where the types range over
{𝑁 , 𝐶}.
其中N和C分别是uninterpreted constant和interpreted constant的集合。S其实是表示某一类符合同一模式的函数的集合。
在这里插入图片描述 S的一个元素定义为I，其中既包含一个函数集合，又包含一个等价关系。
此处定义了另一种judgement符号表示某个atom在一个database中。
理解了一系列定义之后，上面三个推理过程就显而易见了，首先要明确的是对于两种符号，带箭头的符号优先级比不带箭头的符号高。
第一个：对于每个i都满足ti和vi的函数关系，且这些函数关系同属于一个database，那么将DB中的函数参数从vi换成ti依旧能满足上述关系。
第二个：对于任何一个Database，都存在变量到自身的函数。
第三个：对于任何函数都能退出v，那么该函数本身也是成立的。

定义了pre-instance和valid instance以及转换函数rebuilding operator。
在这里插入图片描述根据上述函数的定义给出了semi-naive evaluation algorithm.

IMPLEMENTATION

用Rust实现的。。。不是很熟悉。

Components

egglog’s rebuilding procedure is based on the rebuilding procedure from
egg [Willsey et al. 2021], which is in turn based on the congruence closure algorithm from Downey
et al. [1980].
上述两篇对理解Rebuilding operator会有帮助。
在这里插入图片描述结合此处的例子更好理解Rebuilding operator的作用，首先有一个函数f对两个不同的输入有不同的输出，现在将a和c进行union操作，会出现对a的输出同时为b和d，这就出现了冲突（类似于常量折叠中出现了某个变量可能对应两个常量）。因此需要进行merge，获得统一的结果，但这种merge又可能会造成更多的union,最终程序要一直运行到固定点。
问题在于？如何确保程序一定能够结束。

CASE STUDIES

[Balatsouras and Smaragdakis 2016; Bravenboer and Smaragdakis
2009; Whaley et al. 2005]
一些程序分析工具。
文中将指针分析工具分为两类，一类是基于subset的例如Anderson类型的，一类是基于unification的，例如Steensgaard类型的，前者精确但是二次复杂度的，后者是线性复杂度的但不够精确。