基于缺陷报告和源代码的相似缺陷识别方法

Abstract

In recent years,with the rapid development of software technology,the scale of software systems and development teams have also grown rapidly.Therefore, the number of bug reports to be repaired by developers has also multiplied,and it is difficult for developers to fully understand the entire code file of the whole software system.Therefore,improving the distribution efficiency of defect reports and locating the code files to be amended to improve the efficiency of repairing defects is a problem that needs to be solved urgently.

Most of the current defect report and code file matching methods directly calculate the correlation degree based on the matching degree between the defect report and the code file,and sort all the code files based on the correlation,and recommend the top ranked code file for the defect report to be repaired.However, for the new bug report to be repaired,the similar bug report can be found from the historical bug report collection,and then the corresponding code file can be collected and finally the code file set similar to these code files can be found as a candidate correction code file for this new bug.Conversely,based on the current method of matching the bug report with the source code,it is possible to improve the practicability by using similar bug reports and similar code file identification methods.

Specifically,the main contributions of this paper are as follows:

First of all,this paper proposes a similar bug report identification method based on verb phrase and topic model.In addition to using commonly used vector space model methods to represent textual information,syntactic analysis,which is used to extract structured information of bug reports,is an important basis for measuring the similarity between reports.

Secondly,using the similar defect report identification to improve the existing method for locating the code file to be repaired for the defect report. Compared with the existing research,this method considers the guidance of the historical defect report set to repair the current defect,and improves the present There is a method of recall that improves the efficiency of software repair.

Finally,improve the vector feature construction method of existing similar code detection methods,consider the influence of program capacity on experimental results,improve the accuracy of existing methods,and use the hierarchical structure of abstract syntax tree and local sensitive hash algorithm to reduce the computation the complexity.In addition,the utility of the method in different high-level programming languages is also verified.

Abstract

Keywords:similar bug,bug report,syntactic analysis,verb phrase,abstract syntax tree

摘要...............................................................................................................................I ABSTRACT....................................................................................................................II 第1章绪论. (1)

1.1课题来源 (1)

1.2课题研究的背景和意义 (1)

1.3国内外研究现状 (3)

1.3.1相似缺陷报告识别方法的国内外研究现状 (4)

1.3.2相似代码识别方法的国内外研究现状 (5)

1.3.3缺陷报告的代码文件定位方法的国内外研究现状 (6)

1.4课题研究的主要内容及章节安排 (7)

第2章基于动宾短语和主题模型的相似缺陷报告识别方法 (10)

2.1引言 (10)

2.2方法的总体思路 (10)

2.3基于白名单的特征向量构建 (12)

2.4基于结构化信息的特征向量构建 (12)

2.4.1句法分析 (13)

2.4.2启发式过滤规则 (15)

2.4.3领域术语自动抽取 (16)

2.5基于LDA主题模型的特征向量构建 (18)

2.6基于动宾短语和主题模型的缺陷报告分类方法 (19)

2.7实验结果与分析 (21)

2.8本章小结 (22)

第3章基于文本分析的缺陷报告与源代码匹配方法 (23)

3.1引言 (23)

3.2方法的总体思路 (23)

3.3数据预处理 (25)

3.4基于缺陷报告与代码文件的相关度计算 (26)

3.4.1token匹配程度的相关度计算 (26)

3.4.2文本相似度计算 (27)

3.4.3基于SVM的相关度计算 (29)

3.4.4相关度的线性组合 (30)

3.5缺陷报告与代码文件匹配方法 (30)

3.6实验结果与分析 (31)

3.7本章小结 (32)

第4章基于局部敏感哈希和抽象语法树的相似代码识别方法 (33)

4.1引言 (33)

4.2方法的总体思路 (34)

4.3抽象语法树 (35)

4.4基于抽象语法树的特征向量构建 (38)

4.5局部敏感哈希算法 (41)

4.6相似代码识别方法 (42)

4.7实验结果与分析 (43)

4.8本章小结 (48)

结论 (49)

参考文献 (51)

攻读硕士学位期间发表的论文及其它成果 (56)

哈尔滨工业大学学位论文原创性声明和使用权限.............错误！未定义书签。致谢. (57)