Please wait a minute...
 首页  期刊介绍 期刊订阅 联系我们 横山亮次奖 百年刊庆
 
最新录用  |  预出版  |  当期目录  |  过刊浏览  |  阅读排行  |  下载排行  |  引用排行  |  横山亮次奖  |  百年刊庆
清华大学学报(自然科学版)  2018, Vol. 58 Issue (7): 630-638    DOI: 10.16511/j.cnki.qhdxxb.2018.25.029
  计算机科学与技术 本期目录 | 过刊浏览 | 高级检索 |
开源代码仓库增量分析方法
许福, 杨湛宇, 陈志泊, 孙钰, 张海燕
北京林业大学 信息学院, 北京 100083
Incremental analysis of open source repositories
XU Fu, YANG Zhanyu, CHEN Zhibo, SUN Yu, ZHANG Haiyan
School of Information Science and Technology, Beijing Forestry University, Beijing 100083, China
全文: PDF(2128 KB)  
输出: BibTeX | EndNote (RIS)      
摘要 代码溯源是开源软件复用中的常见实践,溯源过程依赖于高效的程序分析方法支撑。现有的程序分析方法主要识别完整的语法结构,分析时间依赖于整体代码规模,缺乏增量分析能力,难以满足大规模开源代码仓库的高效分析需求。针对开源代码仓库中相邻快照间高度相似的特点,该文提出了一种有效的增量分析方法,仅对快照中变更的代码进行分析,从而有效减少分析规模。首先解析文件快照获得历次代码的修改内容,其次设计映射算法将上述修改内容映射成完整的、可分析的函数,最后将上述函数转化为指纹进行函数比对。与传统分析方法相比,该文方法有效减少了开源代码仓库的分析规模,加快了函数比对速度,能更好地支撑代码溯源等开源软件复用需求。
服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
许福
杨湛宇
陈志泊
孙钰
张海燕
关键词 开源代码程序分析增量分析代码仓库    
Abstract:Code traceability is a common practice for reusing open source software which relies heavily on efficient code analysis methods. Existing methods mainly identify complete grammatical structures with the analysis time depending on the total code size, so they lack the ability to do incremental analyses and cannot be used to analyze large open source code repositories. An incremental analysis method was developed here to analyze only the changed parts in code repositories based on the similarity between adjacent snapshots to effectively reduce the analysis scale. The method first parses snapshots to retrieve the modified content between snapshots and then maps these modifications into complete, analyzable functions. These functions are then converted to fingerprints for comparisons. This method significantly reduces the scale of the open source code repositories compared with traditional analysis methods to speed up function comparisons for better traces of the origin of open source codes.
Key wordsopen source    program analysis    incremental parsing    code repository
收稿日期: 2017-12-05      出版日期: 2018-07-15
基金资助:国家自然科学基金资助项目(61772078);北京市科委重大科技专项(D171100001817003)
通讯作者: 陈志泊,教授,E-mail:zhibo@bjfu.edu.cn     E-mail: zhibo@bjfu.edu.cn
引用本文:   
许福, 杨湛宇, 陈志泊, 孙钰, 张海燕. 开源代码仓库增量分析方法[J]. 清华大学学报(自然科学版), 2018, 58(7): 630-638.
XU Fu, YANG Zhanyu, CHEN Zhibo, SUN Yu, ZHANG Haiyan. Incremental analysis of open source repositories. Journal of Tsinghua University(Science and Technology), 2018, 58(7): 630-638.
链接本文:  
http://jst.tsinghuajournals.com/CN/10.16511/j.cnki.qhdxxb.2018.25.029  或          http://jst.tsinghuajournals.com/CN/Y2018/V58/I7/630
  图1 算法流程图
  图2 Unifieddiffhunk
  图3 Java语言中的函数定义
  表1 NGgram 生成过程示例
  表2 单个文件增量分析结果
  表3 项目增量分析结果
  图4 原始函数
  图5 第1类变换后的函数
  图6 第2类变换后的函数
  图7 函数相似度分析结果
  表4 Redisson各版本函数指纹相似度计算结果
  图8 Redisson相似度分析结果
[1] GITHUB INC. GitHub official website[EB/OL].[2017-11-16]. https://www.github.com/.
[2] BLACK DUCK SOFTWARE INC. OpenHub official website[EB/OL].[2017-11-20]. https://www.openhub.net/.
[3] SONATYPE INC. Sonatype official website[EB/OL].[2017-11-10]. https://www.sonatype.com/.
[4] 金芝, 周明辉, 张宇霞. 开源软件与开源软件生态:现状与趋势[J]. 科技导报, 2016, 34(14):42-48.JIN Z, ZHOU M H, ZHANG Y X. Open source software and its eco-systems:Today and tomorrow[J]. Science & Technology Review, 2016, 34(14):42-48. (in Chinese)
[5] MATHUR A, CHOUDHARY H, VASHIST P, et al. An empirical study of license violations in open source projects[C]//Proceedings of the 35th Software Engineering Workshop. Heraklion, Crete, Greece:IEEE, 2012:168-176.
[6] WIKIMEDIA FOUNDATION. GNU general public license[EB/OL].[2017-11-10]. http://en.wikipedia.org/wiki/GNU_General_Public_License.
[7] APACHE SOFTWARE FOUNDATION. Apache license v2[EB/OL].[2017-10-12]. http://www.apache.org/licenses/license-2.0.
[8] BROWN E. Cisco sued for Linksys GPL violation[EB/OL].[2017-11-18]. http://linuxdevices.linuxgizmos.com/cisco-sued-for-linksys-gpl-violation/.
[9] VLASENKO D. BusyBox official website[EB/OL].[2017-10-12]. http://www.busybox.net.
[10] WIKIMEDIA FOUNDATION. Oracle America, Inc. v. Google, Inc.[EB/OL].[2017-11-20]. https://en.wikipedia.org/wiki/Oracle_America,_Inc._v._Google,_Inc./.
[11] BOUGHANMI F. Multi-language and heterogeneously-licensed software analysis[C]//Proceedings of the 17th Working Conference on Reverse Engineering. Beverly, MA, USA:IEEE Computer Society, 2010:293-296.
[12] GERMAN D, PENTA M D. A method for open source license compliance of java applications[J]. IEEE Software, 2012, 29(3):58-63.
[13] GERMAN D M, HASSAN A E. License integration patterns:Addressing license mismatches in component-based development[C]//Proceedings of the 31st International Conference on Software Engineering. Vancouver, British Columbia, Canada:IEEE, 2009:188-198.
[14] UDDIN M S, ROY C K, SCHNEIDER K A, et al. On the effectiveness of simhash for detecting near-miss clones in large scale software systems[C]//Proceedings of the 18th Working Conference on Reverse Engineering. Lero, Limerick, Ireland:IEEE, 2011:13-22.
[15] SCHWARZ N, LUNGU M, ROBBES R. On how often code is cloned across repositories[C]//Proceedings of the 34th International Conference on Software Engineering. Zurich, Switzerland:IEEE, 2012:1289-1292.
[16] 夏杨添. 论计算机软件专利制度同源代码开放的冲突与协调[D]. 北京:中国政法大学, 2008.XIA Y T. On the conflict and coordination between the open source of computer software patent system[D]. Beijing:China University of Political Science and Law, 2008. (in Chinese)
[17] BLACK DUCK SOFTWARE INC. Black duck software official website[EB/OL].[2017-11-12]. http://www.blackducksoftware.com/.
[18] PALAMIDA INC. Palamida official website[EB/OL].[2017-11-20]. http://www.palamida.com/.
[19] ESTUBLIER J. Software configuration management:A roadmap[C]//Proceedings of the Conference on the Future of Software Engineering. Limerick, Ireland:ACM, 2000:279-289.
[20] APACHE SOFTWARE FOUNDATION. Apache subversion official website[EB/OL].[2017-10-25]. http://subversion.apache.org/.
[21] WIKIMEDIA FOUNDATION. Diff[EB/OL].[2017-11-13]. https://en.wikipedia.org/wiki/Diff.
[22] GNU. Detailed description of unified format[EB/OL].[2017-11-4]. http://www.gnu.org/software/diffutils/manual/html_node/Detailed-Unified.html#Detailed-Unified.
[23] BAKER B S. A program for identifying duplicated code[J]. Computing Science & Statistics, 1992, 24:49-57.
[24] WISE M J. Detection of similarities in student programs:YAP'ing may be preferable to plague'ing[C]//Sigcse Technical Symposium on Computer Science Education. Kansas City, Missouri, USA:ACM, 1992:268-271.
[25] RAHAL I, DEGIOVANNI J. Towards efficient source code plagiarism detection:An N-gram-based approach[C]//Proceedings of the 21st International Conference on Computer Applications in Industry and Engineering. Honolulu, Hawaii, USA:DBLP, 2008:174-179.
[26] RAO M A N, STEVENSON M, CLOUGH P. University of Sheffield:Lab report for PAN at CLEF 2010[C]//Proceedings of the 4th International Workshop on Uncovering Plagiarism Authorship, and Social Software Misuse. Padua, Italy:DBLP, 2010:9-16.
[27] BAXTER I D, YAHIN A, MOURA L, et al. Clone detection using abstract syntax trees[C]//Proceedings of the International Conference on Software Maintenance. Bethesda, MD, USA:IEEE, 2002:368-377.
[28] KOSCHKE R, FALKE R, FRENZEL P. Clone detection using abstract syntax suffix trees[C]//Proceedings of the 13th Working Conference on Reverse Engineering. Benevento, Italy:IEEE, 2006:253-262.
[29] CUI B, LI J, GUO T, et al. Code comparison system based on abstract syntax tree[C]//Proceedings of the 3rd IEEE International Conference on Broadband Network and Multimedia Technology. Beijing, China:IEEE, 2011:668-673.
[30] LIU C, CHEN C, HAN J, et al. GPLAG:Detection of software plagiarism by program dependence graph analysis[C]//Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Philadelphia, PA, USA:ACM, 2006:872-881.
[31] KOMONDOOR R, HORWITZ S. Using slicing to identify duplication in source code[C]//Proceedings of the 8th International Symposium on Static Analysis. Paris, France:Springer-Verlag, 2001:40-56.
[32] HOTTA K, HIGO Y, KUSUMOTO S. Identifying, tailoring, and suggesting form template method refactoring opportunities with program dependence graph[C]//Proceedings of the 16th European Conference on Software Maintenance and Reengineering. Szeged, Hungary:IEEE, 2012:53-62.
[33] KRINKE J. Identifying similar code with program dependence graphs[C]//Proceedings of the 8th Working Conference on Reverse Engineering. Stuttgart, Germany:IEEE, 2001:301-309.
[34] CHANG H F, MOCKUS A. Constructing universal version history[C]//Proceedings of the 28th International Conference on Software Engineering. Shanghai, China:ACM, 2006:76-79.
[35] ROY C K, CORDY J R. An empirical study of function clones in open source software[C]//Proceedings of the 15th Working Conference on Reverse Engineering. Antwerp, Belgium:IEEE, 2008:81-90.
[36] PAULS A, DAN K. Faster and smaller N-gram language models[C]//Proceedings of the Meeting of the Association for Computational Linguistics:Human Language Technologies. Portland, Oregon, USA:DBLP, 2011:258-267.
[37] 吴斐, 唐雁. 基于N-gram的程序代码抄袭检测方法研究[D]. 重庆:西南大学, 2012.WU F, TANG Y. Research of source code plagiarism detection method based on N-gram[D]. Chongqing:Southwest University, 2012. (in Chinese)
[38] CRAW S. Manhattan distance[M]. New York, USA:Springer, 2011.
[39] TIOBE. TIOBE index[EB/OL].[2017-10-27]. http://www.tiobe.com/tiobe-index/.
[40] REDISSON. Redisson official website[EB/OL].[2017-11-13]. https://redisson.org.
[1] 马锐, 高浩然, 窦伯文, 王夏菁, 胡昌振. 基于改进GN算法的程序控制流图划分方法[J]. 清华大学学报(自然科学版), 2019, 59(1): 15-22.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 《清华大学学报(自然科学版)》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn