Bug report quality detection based on the BM25 algorithm
CHEN Lele1, HUANG Song1,2, SUN Jinlei1, HUI Zhanwei1, WU Kaishun1
1. College of Command&Control Engineering, Army Engineering University of PLA, Nanjing 210007, China; 2. PLA Military Software Testing and Evaluation Center, Nanjing 210007, China
Abstract:Bug reports are used to identify and track defects for improving software quality. Software testing often uses multiple users and parallel testing. The resulting numerous bug reports must then be integrated while removing fake or duplicate bug reports. This paper presents an automatic detection method for bug reports based on the BM25 algorithm. After preprocessing the bug reports, a matching library is built based on the test requirements and test report samples. The BM25 algorithm is used to calculate the similarities between reports to identify accurate bug reports. Tests with software test contest data show that the model can correctly judge most bug reports to effectively improve the efficiency of identifying false negatives and duplicates.
[1] Eclipse Foundation. Eclipse official website.. https://www.eclipse.org/. [2] Mozilla official website... http://www.mozilla.org/en-US. [3] BETTENBURG N, PREMRAJ R, ZIMMERMANN T, et al. Duplicate bug reports considered harmful … really?[C]//2008 IEEE International Conference on Software Maintenance. Beijing, China:IEEE, 2008. [4] ANVIK J, HIEW L, MURPHY G C. Who should fix this bug?[C]//Proceedings of the 28th International Conference on Software Engineering. Shanghai, China:ICSE, 2006. [5] THOMAS S W, NAGAPPAN M, BLOSTEIN D, et al. The impact of classifier configuration and classifier combination on bug localization[J]. IEEE Transactions on Software Engineering, 2013, 39(10):1427-1443. [6] RUNESON P, ALEXANDERSSON M, NYHOLM O. Detection of duplicate defect reports using natural language processing[C]//Proceedings of the 29th International Conference on Software Engineering. Minneapolis, USA:IEEE, 2007. [7] WANG X Y, ZHANG L, XIE T, et al. An approach to detecting duplicate bug reports using natural language and execution information[C]//2008 ACM/IEEE 30th International Conference on Software Engineering. Leipzig, Germany:IEEE, 2008. [8] KAUSHIK N, TAHVILDARI L. A comparative study of the performance of IR models on duplicate bug detection[C]//Proceedings of the 2012 16th European Conference on Software Maintenance and Reengineering. Washington, USA:IEEE, 2012. [9] DEERWESTER S, DUMAIS S T, FURNAS G W, et al. Indexing by latent semantic analysis[J]. Journal of the American Society for Information Science, 1990, 41(6):391-407. [10] LANDAUER T K, MCNAMARA D S, DENNIS S, et al. Handbook of latent semantic analysis[M]. Mahwah, USA:Lawrence Erlbaum Associates Publishers, 2007. [11] BLEI D M, NG A Y, JORDAN M I. Latent Dirichlet allocation[J]. Journal of Machine Learning Research, 2003, 3:993-1022. [12] KANERVA P, KRISTOFERSON J, HOLST A. Random indexing of text samples for latent semantic analysis[C]//Proceedings of the 22nd Annual Conference of the Cognitive Science Society. Philadelphia, USA:University of Pennsylvania, 2000:103-106. [13] SUN C N, LO D, KHOO S C, et al. Towards more accurate retrieval of duplicate bug reports[C]//Proceedings of the 26th IEEE/ACM International Conference on Automated Software Engineering. Lawrence, USA:IEEE, 2011. [14] NGUYEN A T, NGUYEN T T, NGUYEN T N, et al. Duplicate bug report detection with a combination of information retrieval and topic modeling[C]//2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering. Essen, Germany:IEEE, 2012. [15] Information Retrieval. Wikipedia for information retrieval... http://wikipedia.hk.wjbk.site/wiki/信息检索/. [16] SALTON G, WONG A, YANG C S. A vector space model for automatic indexing[J]. Communications of the ACM, 1975, 18(11):613-620. [17] PAPADIMITRIOU C H, RAGHAVAN P, TAMAKI H, et al. Latent semantic indexing:A probabilistic analysis[J]. Journal of Computer and System Sciences, 2000, 61(2):217-235. [18] ZHENG B, MCLEAN JR D C, LU X H. Identifying biological concepts from a protein-related corpus with a probabilistic topic model[J]. BMC Bioinformatics, 2006, 7(1):58. [19] WALLACH H M. Topic modeling:Beyond bag-of-words[C]//Proceedings of the 23rd International Conference on Machine Learning. New York, USA:ACM, 2006. [20] ROBERTSON S E, ZARAGOZA H, TAYLOR M. Simple BM25 extension to multiple weighted fields[C]//Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management. Washington DC, USA:ACM, 2004. [21] WANG J J, WANG S, CUI Q, et al. Local-based active classification of test report to assist crowdsourced testing[C]//Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. New York, USA:ACM, 2016.