基于BM25算法的问题报告质量检测方法

陈乐乐, 黄松, 孙金磊, 惠战伟, 吴开舜

清华大学学报(自然科学版) ›› 2020, Vol. 60 ›› Issue (10) : 829-836.

PDF(1127 KB)
PDF(1127 KB)
清华大学学报(自然科学版) ›› 2020, Vol. 60 ›› Issue (10) : 829-836. DOI: 10.16511/j.cnki.qhdxxb.2020.25.002
专题:容错计算

基于BM25算法的问题报告质量检测方法

  • 陈乐乐1, 黄松1,2, 孙金磊1, 惠战伟1, 吴开舜1
作者信息 +

Bug report quality detection based on the BM25 algorithm

  • CHEN Lele1, HUANG Song1,2, SUN Jinlei1, HUI Zhanwei1, WU Kaishun1
Author information +
文章历史 +

摘要

问题报告作为记录和跟踪缺陷的载体,为解决软件质量问题提供依据。目前软件测试常以多人、并行的方式进行,海量问题报告的去假与去重等整合过程正面临严峻的挑战。因此,该文提出一种基于BM25算法的问题报告自动化检测方法,在对问题报告进行预处理后,依据测试需求和测试报告样本建立匹配库,利用BM25算法计算两者的相似度得分,并以此为依据检测问题报告的正确性。在软件测试大赛的数据上进行实验,结果表明该文提出的方法能够正确评判大部分问题报告,有效提高了去假与去重效率。

Abstract

Bug reports are used to identify and track defects for improving software quality. Software testing often uses multiple users and parallel testing. The resulting numerous bug reports must then be integrated while removing fake or duplicate bug reports. This paper presents an automatic detection method for bug reports based on the BM25 algorithm. After preprocessing the bug reports, a matching library is built based on the test requirements and test report samples. The BM25 algorithm is used to calculate the similarities between reports to identify accurate bug reports. Tests with software test contest data show that the model can correctly judge most bug reports to effectively improve the efficiency of identifying false negatives and duplicates.

关键词

软件测试 / BM25算法 / 问题报告 / 自然语言处理

Key words

software testing / BM25 algorithm / bug report / natural language processing

引用本文

导出引用
陈乐乐, 黄松, 孙金磊, 惠战伟, 吴开舜. 基于BM25算法的问题报告质量检测方法[J]. 清华大学学报(自然科学版). 2020, 60(10): 829-836 https://doi.org/10.16511/j.cnki.qhdxxb.2020.25.002
CHEN Lele, HUANG Song, SUN Jinlei, HUI Zhanwei, WU Kaishun. Bug report quality detection based on the BM25 algorithm[J]. Journal of Tsinghua University(Science and Technology). 2020, 60(10): 829-836 https://doi.org/10.16511/j.cnki.qhdxxb.2020.25.002

参考文献

[1] Eclipse Foundation. Eclipse official website.. https://www.eclipse.org/.
[2] Mozilla official website... http://www.mozilla.org/en-US.
[3] BETTENBURG N, PREMRAJ R, ZIMMERMANN T, et al. Duplicate bug reports considered harmful … really?[C]//2008 IEEE International Conference on Software Maintenance. Beijing, China:IEEE, 2008.
[4] ANVIK J, HIEW L, MURPHY G C. Who should fix this bug?[C]//Proceedings of the 28th International Conference on Software Engineering. Shanghai, China:ICSE, 2006.
[5] THOMAS S W, NAGAPPAN M, BLOSTEIN D, et al. The impact of classifier configuration and classifier combination on bug localization[J]. IEEE Transactions on Software Engineering, 2013, 39(10):1427-1443.
[6] RUNESON P, ALEXANDERSSON M, NYHOLM O. Detection of duplicate defect reports using natural language processing[C]//Proceedings of the 29th International Conference on Software Engineering. Minneapolis, USA:IEEE, 2007.
[7] WANG X Y, ZHANG L, XIE T, et al. An approach to detecting duplicate bug reports using natural language and execution information[C]//2008 ACM/IEEE 30th International Conference on Software Engineering. Leipzig, Germany:IEEE, 2008.
[8] KAUSHIK N, TAHVILDARI L. A comparative study of the performance of IR models on duplicate bug detection[C]//Proceedings of the 2012 16th European Conference on Software Maintenance and Reengineering. Washington, USA:IEEE, 2012.
[9] DEERWESTER S, DUMAIS S T, FURNAS G W, et al. Indexing by latent semantic analysis[J]. Journal of the American Society for Information Science, 1990, 41(6):391-407.
[10] LANDAUER T K, MCNAMARA D S, DENNIS S, et al. Handbook of latent semantic analysis[M]. Mahwah, USA:Lawrence Erlbaum Associates Publishers, 2007.
[11] BLEI D M, NG A Y, JORDAN M I. Latent Dirichlet allocation[J]. Journal of Machine Learning Research, 2003, 3:993-1022.
[12] KANERVA P, KRISTOFERSON J, HOLST A. Random indexing of text samples for latent semantic analysis[C]//Proceedings of the 22nd Annual Conference of the Cognitive Science Society. Philadelphia, USA:University of Pennsylvania, 2000:103-106.
[13] SUN C N, LO D, KHOO S C, et al. Towards more accurate retrieval of duplicate bug reports[C]//Proceedings of the 26th IEEE/ACM International Conference on Automated Software Engineering. Lawrence, USA:IEEE, 2011.
[14] NGUYEN A T, NGUYEN T T, NGUYEN T N, et al. Duplicate bug report detection with a combination of information retrieval and topic modeling[C]//2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering. Essen, Germany:IEEE, 2012.
[15] Information Retrieval. Wikipedia for information retrieval... http://wikipedia.hk.wjbk.site/wiki/信息检索/.
[16] SALTON G, WONG A, YANG C S. A vector space model for automatic indexing[J]. Communications of the ACM, 1975, 18(11):613-620.
[17] PAPADIMITRIOU C H, RAGHAVAN P, TAMAKI H, et al. Latent semantic indexing:A probabilistic analysis[J]. Journal of Computer and System Sciences, 2000, 61(2):217-235.
[18] ZHENG B, MCLEAN JR D C, LU X H. Identifying biological concepts from a protein-related corpus with a probabilistic topic model[J]. BMC Bioinformatics, 2006, 7(1):58.
[19] WALLACH H M. Topic modeling:Beyond bag-of-words[C]//Proceedings of the 23rd International Conference on Machine Learning. New York, USA:ACM, 2006.
[20] ROBERTSON S E, ZARAGOZA H, TAYLOR M. Simple BM25 extension to multiple weighted fields[C]//Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management. Washington DC, USA:ACM, 2004.
[21] WANG J J, WANG S, CUI Q, et al. Local-based active classification of test report to assist crowdsourced testing[C]//Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. New York, USA:ACM, 2016.

基金

黄松,教授,E-mail:huangs_0317@126.com

PDF(1127 KB)

Accesses

Citation

Detail

段落导航
相关文章

/