自动文摘评测方法：Rouge-1、Rouge-2、Rouge-L、Rouge-S 评测指标

本文主要是介绍自动文摘评测方法：Rouge-1、Rouge-2、Rouge-L、Rouge-S 评测指标，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

前言

关于Rouge

Rouge-1、Rouge-2、Rouge-N

Rouge-L

Rouge-L的改进版 — Rouge-W

Rouge-S

多参考摘要的情况

前言

最近在看自动文摘的论文，之前对Rouge评测略有了解，为了更好的理解Rouge评测原理，查了些资料，并简单总结。

关于Rouge

Rouge(Recall-Oriented Understudy for Gisting Evaluation)，是评估自动文摘以及机器翻译的一组指标。它通过将自动生成的摘要或翻译与一组参考摘要（通常是人工生成的）进行比较计算，得出相应的分值，以衡量自动生成的摘要或翻译与参考摘要之间的“相似度”。

Rouge-1、Rouge-2、Rouge-N

论文[3]中对Rouge-N的定义是这样的：

分母是n-gram的个数，分子是参考摘要和自动摘要共有的n-gram的个数。直接借用文章[2]中的例子说明一下：
自动摘要Y（一般是自动生成的）：

the cat was found under the bed

参考摘要，X1 gold standard ，人工生成的）：

the cat was under the bed

summary的1-gram、2-gram如下，N-gram以此类推：

#	1-gram	reference 1-gram	2-gram	reference 2-gram
1	the	the	the cat	the cat
2	cat	cat	cat was	cat was
3	was	was	was found	was under
4	found	under	found under	under the
5	under	the	under the	the bed
6	the	bed	the bed
7	bed
coun	7	6	6	5

$Ruge_1(X1,Y) = \frac{6}{5} = 1.0$ 分子是待评测摘要和参考摘要都出现的1-gram的个数，分子是参考摘要的1-gram个数。（其实分母也可以是待评测摘要的，但是在精确率和召回率之间，我们更关心的是召回率Recall，同时这也和上面ROUGN-N的公式相同）
同样 $Rouge_1(X1,Y) = \frac{4}{5} = 0.8$

Rouge-L

L即是LCS(longest common subsequence，最长公共子序列)的首字母，因为Rouge-L使用了最长公共子序列。Rouge-L计算方式如下图：

Rouge-L

其中 $LCS(X,Y)$ 是X和Y的最长公共子序列的长度，m,n分别表示参考摘要和自动摘要的长度（一般就是所含词的个数），
$R_{LCS}$ , $P_{LCS}$ 分别表示召回率和准确率。最后的 $F_{LCS}$ 即是我们所说的Rouge-L。在DUC中， $\beta$ 被设置为一个很大的数，所以 $Rouge_L$ 几乎只考虑了 $R_{LCS}$ ，与上文所说的一般只考虑召回率对应。

Rouge-L的改进版 — Rouge-W

论文[3]针对Rouge-L提出了一个问题：

problem

图中， $X$ 是参考文摘， $Y_{1} , Y_{2}$ 是两个待评测文摘，明显 $Y_{1}$ 要优于 $Y_{2}$ ，因为 $Y_{1}$ 可以和参考摘要 $X$ 连续匹配，但是 $Rouge_L(X,Y_{1})=Rouge_L(X,Y_{2})$ 针对这个问题论文作者提出了改进的方案—加权最长公共子序列(Weighted Longest Common Subsequence)。关于Rouge-W的详细内容请参看论文[3]。

Rouge-S

即使用了skip-grams，在参考摘要和待评测摘要进行匹配时，不要求gram之间必须是连续的，可以“跳过”几个单词，比如skip-bigram，在产生grams时，允许最多跳过两个词。比如“cat in the hat”的 skip-bigrams 就是 “cat in, cat the, cat hat, in the, in hat, the hat”.

多参考摘要的情况

某一个人的对谋篇文档的摘要也不一定准确，所以针对一篇文档，标准数据集一般有多个参考摘要(DUC数据集就有4个)。针对这个问题，论文作者也提出了多参考摘要的解决方案：

mutiple-references

论文中的详细描述如下：

This procedure is also applied to computation of ROUGE-L (Section 3), ROUGE-W (Section 4), and ROUGE-S (Section 5). In the implementation, we use a Jackknifing procedure. Given M references, we compute the best score over M sets of M-1 references. The final ROUGE-N score is the average of the M ROUGE-N scores using different M-1 references.

我的理解是由M个参考摘要 $R= \left \{ r_{1},r_{2},r_{3},...,r_{m-1},r_{m} \right \}$ 产生M个集合

$R_{I} = R- \left \{ r_{i} \right \} , i=1,2,..,M$

然后计算出每个集合 $R_{i}$ 的最高分数
$max score_{i} = max_{r_{j} <R_{i}}Rouge_N(r_{j} ,X)$

最终

$Rouge_Score = \frac{1}{M} \sum_{1}^{M} maxscore_{i}$

本博客参考：

[1].https://en.wikipedia.org/wiki/ROUGE_(metric)
[2].What is ROUGE and how it works for evaluation of summaries?
[3].ROUGE:A Package for Automatic Evaluation of Summaries

这篇关于自动文摘评测方法：Rouge-1、Rouge-2、Rouge-L、Rouge-S 评测指标的文章就介绍到这儿，希望我们推荐的文章对编程师们有所帮助！