缺陷定位论文阅读：[Dongsun Kim] [TSE在投] DC: A Divide-and-Conquer Approach to IR-based Bug Localization

本文主要是介绍缺陷定位论文阅读：[Dongsun Kim] [TSE在投] DC: A Divide-and-Conquer Approach to IR-based Bug Localization，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

文章目录

- 前言
- 0 阅读方案
- 1. D&C: A Divide-and-Conquer Approach to IR-based Bug Localization
- - 1.1 基本信息
  - 1.2 文章内容
  - 1.3 几个QA
  - 1.4 感想

前言

每天都得阅读一定数量的论文。
在此阅读：
1）D&C: A Divide-and-Conquer Approach to IR-based Bug Localization
2）算了，还是尽量一篇博客一篇论文阅读把。

0 阅读方案

因为很多，论文很多，所以只能求快，不能求每一篇都精度，不可能的。
所以见机行事，不要“恋战”。

1. D&C: A Divide-and-Conquer Approach to IR-based Bug Localization

1.1 基本信息

下载地址：在arxiv上就能下。

目测这篇文章是要投TSE期刊的，先发在了arxiv上。

作者有：Dongsun Kim 这位也是学术大牛了。
其主页：http://www.darkrsw.net/

在这里插入图片描述

在卢森堡大学。

1.2 文章内容

先介绍IR技术，引出FL缺陷定位：

Many automated tasks in software maintenance rely on information retrieval (IR) techniques to identify specific information within unstructured data. Bug localization is such a typical task, where text in a bug report is analyzed to identify file locations in the source code that can be associated to the reported bug.

指出问题：

Unfortunately, despite the promising results reported in the literature, the performance offered by IR-based bug localization tools is still not significant for large adoption.

给出自己认为的原因：

We argue that one reason could be the attempt by the community to build a “one-size-fits-all” approach for bug localization, without fully addressing the differences of available information that may exist among the bug reports and across the project source code files.

自己的工作：

In this paper, we first extensively study the performance of state-of-the-art bug localization tools, specifically focusing on investigating the query formulation (i.e., which bug report features should be compared against which features of source code files) and its importance with respect to the localization performance.

工作2：

Building on insights from this study, we propose a new learning approach where multiple classifier models are trained on clear-cut sets of bug-location pairs. Concretely, we apply a gradient boosting supervised learning approach to various sets of bug reports whose localizations appear to be successful with specific types of features.

工作3：（工具）

The training scenario builds on our findings that the various state-of-the-art localization tools (hence the associated similarity features that they leverage) can be highly performant for specific sets of bug reports. We implement D&C, a multi-classifier approach, which computes appropriate weights that should be assigned to the similarity measurements between pairs of information token types (the bug report and source code).

大意是：（我的理解、翻译）
现在的IR技术被广泛使用在软件维护领域中，来挖掘特定信息。缺陷定位就是这样的典型方向，IR可以通过分析bug reports来确定可能和reported bug相关联的文件地点（file location）。
但是呢，IR的缺陷定位技术并没有广泛应用。我们认为其原因在于：整个community的人都想做一个one-size-fits-all的缺陷定位技术，却根本没有强调bug reports之间存在的区别，和各个project source code files之间存在的区别。

所以呢，本文先广泛研究当前最先进的（state-of-the-art）缺陷定位工具，专门关注调查query formulation以及其在定位性能上的重要性。在empirical study基础上，我们开发了新的learning approach，即从明确分割的bug-location pairs中训练处多个分类模型。具体的，我们对各种bug reports集合（其定位在特定类型的feature下能够成功）应用了梯度增加监督学习方法。这个训练场景是基于我们的finding的（即：各种定位工具对特定集合的bug reports是有效的）。我们实现了D&C，一个多分类方法，来计算应该给（pairs of information token types之间）相似度度量分配的合适的权重。

1.3 几个QA

1）问：都调研了什么定位工具？dataset又是什么？
答：
在这里插入图片描述

在这里插入图片描述

在这里插入图片描述
2）问：是不是一个当前IR定位工具的combination？
答：如下图，差不多是这个意思，但是不全是，还是有点出入的，具体涉及 IR中的feature，similarity这类术语，我就不多看了。

1.4 感想

1）开始迷茫，很多东西确实被做了，但是也有很多东西确实能做，总而言之，还是有点迷茫的。
自己之前想到的，别人已经想到并且实现了。
还是得积累想法。
2）写作套路还是很固定的，什么unfortunately啦，什么despite啦，都是千篇一律罢了。
3）行行出状元，我认为想法还是很容易想出来的，SFL领域有SFL领域的挑战，这个IR FL 领域也有其自己的挑战，只要钻进去了，还是有机会的。
4）这个想法和我们当时想的特殊情况特殊修复，针对性修复的方法是有共性的。但是我在实现上很有困难。
5）自己想点还是太难了，很多技术我根本没接触，也没应用过，这对我来说很难想出解决方案：（但是未必不行。这就是人之矛盾）

Building on insights from this study, we propose a new learning approach where multiple classifier models are trained on clear-cut sets of bug-location pairs. Concretely, we apply a gradient boosting supervised learning approach to various sets of bug reports whose localizations appear to be successful with specific types of features.

6）又是combination，感觉挺难受的。
7）我一开始以为这篇论文非常不稳，工作量确实很足，但是总感觉创新上还差了点，但是看了文章中的参考文献：

[19] Lee, J., Kim, D., Bissyande, T.F., Jung, W., Le Traon, Y.: Bench4bl: ´ reproducibility study on the performance of IR-based bug localization. In: Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis, pp. 61–72. ACM (2018)

一脉相承。一直在做的，而且18年还发了ISSTA。这，我感觉作者应该是很有把握了。

这篇关于缺陷定位论文阅读：[Dongsun Kim] [TSE在投] DC: A Divide-and-Conquer Approach to IR-based Bug Localization的文章就介绍到这儿，希望我们推荐的文章对编程师们有所帮助！