大规模代码克隆的检测方法

大规模代码克隆的检测方法?

郭颖1,2,陈峰宏1,2,周明辉1,2+

1.北京大学信息科学技术学院软件研究所,北京100871

2.北京大学高可信软件技术教育部重点实验室,北京100871

Code Clone Detection Method for Large-Scale Source Code ?

GUO Ying 1,2,CHEN Fenghong 1,2,ZHOU Minghui 1,2+

1.Institute of Software,School of Electronics Engineering and Computer Science,Peking University,Beijing 100871,China

2.Key Laboratory of High Confidence Software Technologies of Ministry of Education,Peking University,Beijing 100871,China

+Corresponding author:E-mail:zhmh@https://www.360docs.net/doc/e810067888.html,

GUO Ying,CHEN Fenghong,ZHOU Minghui.Code clone detection method for large-scale source code.Journal of Frontiers of Computer Science and Technology,2014,8(4):417-426.

Abstract:The benefits of detecting code clones include detecting plagiarism and copyright infringement,helping in code compacting,error detecting,and finding usage patterns et al.The existing clone detection tools usually use com-plicated algorithm,or need lots of computing resources,so they can not be applied to detect code clones on large-scale code data.In order to implement code clone detection on massive data,this paper proposes a new code clone detection algorithm.The algorithm combines the idea of content-defined chunking (CDC)in data de-duplication and that of Simhash algorithm in finding duplicate webpage,and uses the method of first chunking then fuzzy matching.The algorithm is implemented on a data source which contains more than 500million files of 10TB from a variety of open source projects.This paper compares the influence of choosing different chunk lengths on detection rate and detection time.The experimental results show that the new algorithm can be applied not only to detect large scale code ISSN 1673-9418CODEN JKYTA8

Journal of Frontiers of Computer Science and Technology 1673-9418/2014/08(04)-0417-10doi:10.3778/j.issn.1673-9418.1311018E-mail:fcst@https://www.360docs.net/doc/e810067888.html, https://www.360docs.net/doc/e810067888.html, Tel:+86-10-89056056*The National Natural Science Foundation of China under Grant Nos.91118004,61073016(国家自然科学基金);the National Basic Research Program of China under Grant No.2011CB302604(国家重点基础研究发展计划(973计划));the Joint Funds of the National Natural Science Foundation of China under Grant No.U1201252(国家自然科学基金联合基金资助项目);the National High Tech-nology Research and Development Program of China under Grant No.2012AA011202(国家高技术研究发展计划(863计划)).Received 2013-09,Accepted 2013-11.

CNKI 网络优先出版:2013-12-11,https://www.360docs.net/doc/e810067888.html,/kcms/doi/10.3778/j.issn.1673-9418.1311018.html 郭颖,陈峰宏,周明辉.大规模代码克隆的检测方法[J].计算机科学与探索,2014,8(4):

417-426.

相关文档
最新文档