Please wait a minute...
 首页  期刊介绍 期刊订阅 联系我们 横山亮次奖 百年刊庆
 
最新录用  |  预出版  |  当期目录  |  过刊浏览  |  阅读排行  |  下载排行  |  引用排行  |  横山亮次奖  |  百年刊庆
清华大学学报(自然科学版)  2020, Vol. 60 Issue (10): 829-836    DOI: 10.16511/j.cnki.qhdxxb.2020.25.002
  专题:容错计算 本期目录 | 过刊浏览 | 高级检索 |
基于BM25算法的问题报告质量检测方法
陈乐乐1, 黄松1,2, 孙金磊1, 惠战伟1, 吴开舜1
1. 中国人民解放军陆军工程大学 指挥控制工程学院, 南京 210007;
2. 全军军事软件测评中心, 南京 210007
Bug report quality detection based on the BM25 algorithm
CHEN Lele1, HUANG Song1,2, SUN Jinlei1, HUI Zhanwei1, WU Kaishun1
1. College of Command&Control Engineering, Army Engineering University of PLA, Nanjing 210007, China;
2. PLA Military Software Testing and Evaluation Center, Nanjing 210007, China
全文: PDF(1127 KB)  
输出: BibTeX | EndNote (RIS)      
摘要 问题报告作为记录和跟踪缺陷的载体,为解决软件质量问题提供依据。目前软件测试常以多人、并行的方式进行,海量问题报告的去假与去重等整合过程正面临严峻的挑战。因此,该文提出一种基于BM25算法的问题报告自动化检测方法,在对问题报告进行预处理后,依据测试需求和测试报告样本建立匹配库,利用BM25算法计算两者的相似度得分,并以此为依据检测问题报告的正确性。在软件测试大赛的数据上进行实验,结果表明该文提出的方法能够正确评判大部分问题报告,有效提高了去假与去重效率。
服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
陈乐乐
黄松
孙金磊
惠战伟
吴开舜
关键词 软件测试BM25算法问题报告自然语言处理    
Abstract:Bug reports are used to identify and track defects for improving software quality. Software testing often uses multiple users and parallel testing. The resulting numerous bug reports must then be integrated while removing fake or duplicate bug reports. This paper presents an automatic detection method for bug reports based on the BM25 algorithm. After preprocessing the bug reports, a matching library is built based on the test requirements and test report samples. The BM25 algorithm is used to calculate the similarities between reports to identify accurate bug reports. Tests with software test contest data show that the model can correctly judge most bug reports to effectively improve the efficiency of identifying false negatives and duplicates.
Key wordssoftware testing    BM25 algorithm    bug report    natural language processing
收稿日期: 2019-09-02      出版日期: 2020-07-09
基金资助:黄松,教授,E-mail:huangs_0317@126.com
引用本文:   
陈乐乐, 黄松, 孙金磊, 惠战伟, 吴开舜. 基于BM25算法的问题报告质量检测方法[J]. 清华大学学报(自然科学版), 2020, 60(10): 829-836.
CHEN Lele, HUANG Song, SUN Jinlei, HUI Zhanwei, WU Kaishun. Bug report quality detection based on the BM25 algorithm. Journal of Tsinghua University(Science and Technology), 2020, 60(10): 829-836.
链接本文:  
http://jst.tsinghuajournals.com/CN/10.16511/j.cnki.qhdxxb.2020.25.002  或          http://jst.tsinghuajournals.com/CN/Y2020/V60/I10/829
  
  
  
  
  
  
  
  
[1] Eclipse Foundation. Eclipse official website.. https://www.eclipse.org/.
[2] Mozilla official website... http://www.mozilla.org/en-US.
[3] BETTENBURG N, PREMRAJ R, ZIMMERMANN T, et al. Duplicate bug reports considered harmful … really?[C]//2008 IEEE International Conference on Software Maintenance. Beijing, China:IEEE, 2008.
[4] ANVIK J, HIEW L, MURPHY G C. Who should fix this bug?[C]//Proceedings of the 28th International Conference on Software Engineering. Shanghai, China:ICSE, 2006.
[5] THOMAS S W, NAGAPPAN M, BLOSTEIN D, et al. The impact of classifier configuration and classifier combination on bug localization[J]. IEEE Transactions on Software Engineering, 2013, 39(10):1427-1443.
[6] RUNESON P, ALEXANDERSSON M, NYHOLM O. Detection of duplicate defect reports using natural language processing[C]//Proceedings of the 29th International Conference on Software Engineering. Minneapolis, USA:IEEE, 2007.
[7] WANG X Y, ZHANG L, XIE T, et al. An approach to detecting duplicate bug reports using natural language and execution information[C]//2008 ACM/IEEE 30th International Conference on Software Engineering. Leipzig, Germany:IEEE, 2008.
[8] KAUSHIK N, TAHVILDARI L. A comparative study of the performance of IR models on duplicate bug detection[C]//Proceedings of the 2012 16th European Conference on Software Maintenance and Reengineering. Washington, USA:IEEE, 2012.
[9] DEERWESTER S, DUMAIS S T, FURNAS G W, et al. Indexing by latent semantic analysis[J]. Journal of the American Society for Information Science, 1990, 41(6):391-407.
[10] LANDAUER T K, MCNAMARA D S, DENNIS S, et al. Handbook of latent semantic analysis[M]. Mahwah, USA:Lawrence Erlbaum Associates Publishers, 2007.
[11] BLEI D M, NG A Y, JORDAN M I. Latent Dirichlet allocation[J]. Journal of Machine Learning Research, 2003, 3:993-1022.
[12] KANERVA P, KRISTOFERSON J, HOLST A. Random indexing of text samples for latent semantic analysis[C]//Proceedings of the 22nd Annual Conference of the Cognitive Science Society. Philadelphia, USA:University of Pennsylvania, 2000:103-106.
[13] SUN C N, LO D, KHOO S C, et al. Towards more accurate retrieval of duplicate bug reports[C]//Proceedings of the 26th IEEE/ACM International Conference on Automated Software Engineering. Lawrence, USA:IEEE, 2011.
[14] NGUYEN A T, NGUYEN T T, NGUYEN T N, et al. Duplicate bug report detection with a combination of information retrieval and topic modeling[C]//2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering. Essen, Germany:IEEE, 2012.
[15] Information Retrieval. Wikipedia for information retrieval... http://wikipedia.hk.wjbk.site/wiki/信息检索/.
[16] SALTON G, WONG A, YANG C S. A vector space model for automatic indexing[J]. Communications of the ACM, 1975, 18(11):613-620.
[17] PAPADIMITRIOU C H, RAGHAVAN P, TAMAKI H, et al. Latent semantic indexing:A probabilistic analysis[J]. Journal of Computer and System Sciences, 2000, 61(2):217-235.
[18] ZHENG B, MCLEAN JR D C, LU X H. Identifying biological concepts from a protein-related corpus with a probabilistic topic model[J]. BMC Bioinformatics, 2006, 7(1):58.
[19] WALLACH H M. Topic modeling:Beyond bag-of-words[C]//Proceedings of the 23rd International Conference on Machine Learning. New York, USA:ACM, 2006.
[20] ROBERTSON S E, ZARAGOZA H, TAYLOR M. Simple BM25 extension to multiple weighted fields[C]//Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management. Washington DC, USA:ACM, 2004.
[21] WANG J J, WANG S, CUI Q, et al. Local-based active classification of test report to assist crowdsourced testing[C]//Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. New York, USA:ACM, 2016.
[1] 王昀, 胡珉, 塔娜, 孙海涛, 郭毅峰, 周武爱, 郭昱, 张皖哲, 冯建华. 大语言模型及其在政务领域的应用[J]. 清华大学学报(自然科学版), 2024, 64(4): 649-658.
[2] 王庆人, 王银子, 仲红, 张以文. 面向中文的字词组合序列实体识别方法[J]. 清华大学学报(自然科学版), 2023, 63(9): 1326-1338.
[3] 陆思聪, 李春文. 基于场景与话题的聊天型人机会话系统[J]. 清华大学学报(自然科学版), 2022, 62(5): 952-958.
[4] 胡滨, 耿天玉, 邓赓, 段磊. 基于知识蒸馏的高效生物医学命名实体识别模型[J]. 清华大学学报(自然科学版), 2021, 61(9): 936-942.
[5] 贾旭东, 王莉. 基于多头注意力胶囊网络的文本分类模型[J]. 清华大学学报(自然科学版), 2020, 60(5): 415-421.
[6] 王元龙, 李茹, 张虎, 王智强. 阅读理解中因果关系类选项的研究[J]. 清华大学学报(自然科学版), 2018, 58(3): 272-278.
[7] 白晓颖, 黄军. 基于约束组合的测试用例生成[J]. 清华大学学报(自然科学版), 2017, 57(3): 225-233.
[8] 卢兆麟, 李升波, Schroeder Felix, 周吉晨, 成波. 结合自然语言处理与改进层次分析法的乘用车驾驶舒适性评价[J]. 清华大学学报(自然科学版), 2016, 56(2): 137-143.
[9] 张旭, 王生进. 基于自然语言处理的特定属性物体检测[J]. 清华大学学报(自然科学版), 2016, 56(11): 1137-1142.
[10] 崔宝江, 王福维, 郭涛, 柳本金. 基于污点信息的函数内存模糊测试技术研究[J]. 清华大学学报(自然科学版), 2016, 56(1): 7-13.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 《清华大学学报(自然科学版)》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn