Abstract:Code traceability is a common practice for reusing open source software which relies heavily on efficient code analysis methods. Existing methods mainly identify complete grammatical structures with the analysis time depending on the total code size, so they lack the ability to do incremental analyses and cannot be used to analyze large open source code repositories. An incremental analysis method was developed here to analyze only the changed parts in code repositories based on the similarity between adjacent snapshots to effectively reduce the analysis scale. The method first parses snapshots to retrieve the modified content between snapshots and then maps these modifications into complete, analyzable functions. These functions are then converted to fingerprints for comparisons. This method significantly reduces the scale of the open source code repositories compared with traditional analysis methods to speed up function comparisons for better traces of the origin of open source codes.
[1] GITHUB INC. GitHub official website[EB/OL].[2017-11-16]. https://www.github.com/. [2] BLACK DUCK SOFTWARE INC. OpenHub official website[EB/OL].[2017-11-20]. https://www.openhub.net/. [3] SONATYPE INC. Sonatype official website[EB/OL].[2017-11-10]. https://www.sonatype.com/. [4] 金芝, 周明辉, 张宇霞. 开源软件与开源软件生态:现状与趋势[J]. 科技导报, 2016, 34(14):42-48.JIN Z, ZHOU M H, ZHANG Y X. Open source software and its eco-systems:Today and tomorrow[J]. Science & Technology Review, 2016, 34(14):42-48. (in Chinese) [5] MATHUR A, CHOUDHARY H, VASHIST P, et al. An empirical study of license violations in open source projects[C]//Proceedings of the 35th Software Engineering Workshop. Heraklion, Crete, Greece:IEEE, 2012:168-176. [6] WIKIMEDIA FOUNDATION. GNU general public license[EB/OL].[2017-11-10]. http://en.wikipedia.org/wiki/GNU_General_Public_License. [7] APACHE SOFTWARE FOUNDATION. Apache license v2[EB/OL].[2017-10-12]. http://www.apache.org/licenses/license-2.0. [8] BROWN E. Cisco sued for Linksys GPL violation[EB/OL].[2017-11-18]. http://linuxdevices.linuxgizmos.com/cisco-sued-for-linksys-gpl-violation/. [9] VLASENKO D. BusyBox official website[EB/OL].[2017-10-12]. http://www.busybox.net. [10] WIKIMEDIA FOUNDATION. Oracle America, Inc. v. Google, Inc.[EB/OL].[2017-11-20]. https://en.wikipedia.org/wiki/Oracle_America,_Inc._v._Google,_Inc./. [11] BOUGHANMI F. Multi-language and heterogeneously-licensed software analysis[C]//Proceedings of the 17th Working Conference on Reverse Engineering. Beverly, MA, USA:IEEE Computer Society, 2010:293-296. [12] GERMAN D, PENTA M D. A method for open source license compliance of java applications[J]. IEEE Software, 2012, 29(3):58-63. [13] GERMAN D M, HASSAN A E. License integration patterns:Addressing license mismatches in component-based development[C]//Proceedings of the 31st International Conference on Software Engineering. Vancouver, British Columbia, Canada:IEEE, 2009:188-198. [14] UDDIN M S, ROY C K, SCHNEIDER K A, et al. On the effectiveness of simhash for detecting near-miss clones in large scale software systems[C]//Proceedings of the 18th Working Conference on Reverse Engineering. Lero, Limerick, Ireland:IEEE, 2011:13-22. [15] SCHWARZ N, LUNGU M, ROBBES R. On how often code is cloned across repositories[C]//Proceedings of the 34th International Conference on Software Engineering. Zurich, Switzerland:IEEE, 2012:1289-1292. [16] 夏杨添. 论计算机软件专利制度同源代码开放的冲突与协调[D]. 北京:中国政法大学, 2008.XIA Y T. On the conflict and coordination between the open source of computer software patent system[D]. Beijing:China University of Political Science and Law, 2008. (in Chinese) [17] BLACK DUCK SOFTWARE INC. Black duck software official website[EB/OL].[2017-11-12]. http://www.blackducksoftware.com/. [18] PALAMIDA INC. Palamida official website[EB/OL].[2017-11-20]. http://www.palamida.com/. [19] ESTUBLIER J. Software configuration management:A roadmap[C]//Proceedings of the Conference on the Future of Software Engineering. Limerick, Ireland:ACM, 2000:279-289. [20] APACHE SOFTWARE FOUNDATION. Apache subversion official website[EB/OL].[2017-10-25]. http://subversion.apache.org/. [21] WIKIMEDIA FOUNDATION. Diff[EB/OL].[2017-11-13]. https://en.wikipedia.org/wiki/Diff. [22] GNU. Detailed description of unified format[EB/OL].[2017-11-4]. http://www.gnu.org/software/diffutils/manual/html_node/Detailed-Unified.html#Detailed-Unified. [23] BAKER B S. A program for identifying duplicated code[J]. Computing Science & Statistics, 1992, 24:49-57. [24] WISE M J. Detection of similarities in student programs:YAP'ing may be preferable to plague'ing[C]//Sigcse Technical Symposium on Computer Science Education. Kansas City, Missouri, USA:ACM, 1992:268-271. [25] RAHAL I, DEGIOVANNI J. Towards efficient source code plagiarism detection:An N-gram-based approach[C]//Proceedings of the 21st International Conference on Computer Applications in Industry and Engineering. Honolulu, Hawaii, USA:DBLP, 2008:174-179. [26] RAO M A N, STEVENSON M, CLOUGH P. University of Sheffield:Lab report for PAN at CLEF 2010[C]//Proceedings of the 4th International Workshop on Uncovering Plagiarism Authorship, and Social Software Misuse. Padua, Italy:DBLP, 2010:9-16. [27] BAXTER I D, YAHIN A, MOURA L, et al. Clone detection using abstract syntax trees[C]//Proceedings of the International Conference on Software Maintenance. Bethesda, MD, USA:IEEE, 2002:368-377. [28] KOSCHKE R, FALKE R, FRENZEL P. Clone detection using abstract syntax suffix trees[C]//Proceedings of the 13th Working Conference on Reverse Engineering. Benevento, Italy:IEEE, 2006:253-262. [29] CUI B, LI J, GUO T, et al. Code comparison system based on abstract syntax tree[C]//Proceedings of the 3rd IEEE International Conference on Broadband Network and Multimedia Technology. Beijing, China:IEEE, 2011:668-673. [30] LIU C, CHEN C, HAN J, et al. GPLAG:Detection of software plagiarism by program dependence graph analysis[C]//Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Philadelphia, PA, USA:ACM, 2006:872-881. [31] KOMONDOOR R, HORWITZ S. Using slicing to identify duplication in source code[C]//Proceedings of the 8th International Symposium on Static Analysis. Paris, France:Springer-Verlag, 2001:40-56. [32] HOTTA K, HIGO Y, KUSUMOTO S. Identifying, tailoring, and suggesting form template method refactoring opportunities with program dependence graph[C]//Proceedings of the 16th European Conference on Software Maintenance and Reengineering. Szeged, Hungary:IEEE, 2012:53-62. [33] KRINKE J. Identifying similar code with program dependence graphs[C]//Proceedings of the 8th Working Conference on Reverse Engineering. Stuttgart, Germany:IEEE, 2001:301-309. [34] CHANG H F, MOCKUS A. Constructing universal version history[C]//Proceedings of the 28th International Conference on Software Engineering. Shanghai, China:ACM, 2006:76-79. [35] ROY C K, CORDY J R. An empirical study of function clones in open source software[C]//Proceedings of the 15th Working Conference on Reverse Engineering. Antwerp, Belgium:IEEE, 2008:81-90. [36] PAULS A, DAN K. Faster and smaller N-gram language models[C]//Proceedings of the Meeting of the Association for Computational Linguistics:Human Language Technologies. Portland, Oregon, USA:DBLP, 2011:258-267. [37] 吴斐, 唐雁. 基于N-gram的程序代码抄袭检测方法研究[D]. 重庆:西南大学, 2012.WU F, TANG Y. Research of source code plagiarism detection method based on N-gram[D]. Chongqing:Southwest University, 2012. (in Chinese) [38] CRAW S. Manhattan distance[M]. New York, USA:Springer, 2011. [39] TIOBE. TIOBE index[EB/OL].[2017-10-27]. http://www.tiobe.com/tiobe-index/. [40] REDISSON. Redisson official website[EB/OL].[2017-11-13]. https://redisson.org.