Please wait a minute...
 首页  期刊介绍 期刊订阅 联系我们 横山亮次奖 百年刊庆
 
最新录用  |  预出版  |  当期目录  |  过刊浏览  |  阅读排行  |  下载排行  |  引用排行  |  横山亮次奖  |  百年刊庆
清华大学学报(自然科学版)  2019, Vol. 59 Issue (4): 262-269    DOI: 10.16511/j.cnki.qhdxxb.2018.26.059
  计算机科学与技术 本期目录 | 过刊浏览 | 高级检索 |
基于数据分布的标签噪声过滤
陈庆强1, 王文剑2, 姜高霞1
1. 山西大学 计算机与信息技术学院, 太原 030006;
2. 山西大学 计算智能与中文信息处理教育部重点实验室, 太原 030006
Label noise filtering based on the data distribution
CHEN Qingqiang1, WANG Wenjian2, JIANG Gaoxia1
1. School of Computer and Information Technology, Shanxi University, Taiyuan 030006, China;
2. Key Laboratory of Computation Intelligence and Chinese Information Processing of Ministry of Education, Shanxi University, Taiyuan 030006, China
全文: PDF(9900 KB)  
输出: BibTeX | EndNote (RIS)      
摘要 在监督学习中,标签噪声对模型建立有较大的影响。目前对于标签噪声的处理方法主要有基于模型预测的过滤方法和鲁棒性建模方法,然而这些方法存在过滤效果差或者过滤效率低等问题。针对该问题,该文提出一种基于数据分布的标签噪声过滤方法。首先对于数据集中的每一个样本,根据其近邻内样本的分布,将其及邻域样本形成的区域划分为高密度区域和低密度区域,然后针对不同的区域采用不同的噪声过滤规则进行过滤。与已有方法相比,该方法从数据分布角度出发,使得噪声过滤更具有针对性从而提高过滤效果;此外,使用过滤规则对噪声数据进行处理而非建立噪声预测模型,因而可以提高过滤效率。在15个UCI标准多分类数据集上的实验结果表明:该方法在噪声低于30%时,噪声检测效率和分类精度均有很好的表现。
服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
陈庆强
王文剑
姜高霞
关键词 标签噪声噪声过滤模型鲁棒性数据分布    
Abstract:Label noise can severely influence supervised learning models. Existing methods are mainly based on model predictions and robust prediction modeling. However, these methods are sometimes not effective or efficient. This paper presents a label noise filtering method based on the data distribution. First, the area formed by each sample and the vicinage samples is divided into high density area or low density areas according to the distribution of the vicinage samples. Then, different noise filtering rules are used to deal with the different areas. Thus, this approach takes the data distribution into account so that the label noise filtering is focused on the key data and can avoid over-filtering. Filter rules are used instead of a noise filter forecasting model, which improves the efficiency. Tests on 15 UCI standard multi-class data sets show that this approach is effective and efficient.
Key wordslabel noise    noise filtering    robust modeling    data distribution
收稿日期: 2018-09-10      出版日期: 2019-04-09
基金资助:国家自然科学基金资助项目(61673249);山西省回国留学人员科研基金资助项目(2016-004);赛尔网络下一代互联网技术创新项目(NGII20170601)
通讯作者: 王文剑,教授,E-mail:wjwang@sxu.edu.cn     E-mail: wjwang@sxu.edu.cn
引用本文:   
陈庆强, 王文剑, 姜高霞. 基于数据分布的标签噪声过滤[J]. 清华大学学报(自然科学版), 2019, 59(4): 262-269.
CHEN Qingqiang, WANG Wenjian, JIANG Gaoxia. Label noise filtering based on the data distribution. Journal of Tsinghua University(Science and Technology), 2019, 59(4): 262-269.
链接本文:  
http://jst.tsinghuajournals.com/CN/10.16511/j.cnki.qhdxxb.2018.26.059  或          http://jst.tsinghuajournals.com/CN/Y2019/V59/I4/262
  图1 (网络版彩图)数据分布划分
  图2 噪声过滤的例子
  图3 DDF算法
  表1 数据集描述
  图4 (网络版彩图)参数k、αF 的影响
  图5 (网络版彩图)参数kF 的影响
  图6 过滤性能比较
  图7 算法运行时间比较
  图8 分类准确率环比值
[1] FRENAY B, VERLEYSEN M. Classification in the presence of label noise:A survey[J]. IEEE Transactions on Neural Networks and Learning Systems, 2014, 25(5):845-869.
[2] SEGATA N, BLANZIERI E, DELANY S J, et al. Noise reduction for instance-based learning with a local maximal margin approach[J]. Journal of Intelligent Information Systems, 2010, 35(2):301-331.
[3] VAN DEN HOUT A, VAN DER HEIJDEN P G M. Randomized response, statistical disclosure control and misclassification:A review[J]. International Statistical Review, 2002, 70(2):269-288.
[4] YUAN W W, GUAN D H, MA T H, et al. Classification with class noises through probabilistic sampling[J]. Information Fusion, 2018, 41:57-67.
[5] SABZEVARI M, MARTÍNEZ-MUÑOZ G, SUÁREZ A. A two-stage ensemble method for the detection of class-label noise[J]. Neurocomputing, 2018, 275:2374-2383.
[6] SÁEZ J A, GALAR M, LUENGO J, et al. INFFC:An iterative class noise filter based on the fusion of classifiers with noise sensitivity control[J]. Information Fusion, 2016, 27:19-32.
[7] LUENGO J, SHIM S O, ALSHOMRANI S, et al. CNC-NOS:Class noise cleaning by ensemble filtering and noise scoring[J]. Knowledge-Based Systems, 2018, 140:27-49.
[8] MANWANI N, SASTRY P S. Noise tolerance under risk minimization[J]. IEEE Transactions on Cybernetics, 2013, 43(3):1146-1151.
[9] LIU T L, TAO D C. Classification with noisy labels by importance reweighting[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38(3):447-461.
[10] FRIEDMAN J, HASTIE T, TIBSHIRANI R. Additive logistic regression:A statistical view of boosting[J]. The Annals of Statistics, 2000, 28(2):337-374.
[11] ABELLÁN J, MASEGOSA A R. Bagging decision trees on data sets with classification noise[C]//The 6th International Symposium Foundations of Information and Knowledge Systems. Sofia, Bulgaria:Springer, 2010:248-265.
[12] BARTLETT P L, JORDAN M I, MCAULIFFE J D. Convexity, classification, and risk bounds[J]. Journal of the American Statistical Association, 2006, 101(473):138-156.
[13] WILSON D R, MARTINEZ T R. Reduction techniques for instance-based learning algorithms[J]. Machine Learning, 2000, 38(3):257-286.
[14] WILSON D L. Asymptotic properties of nearest neighbor rules using edited data[J]. IEEE Transactions on Systems, Man, and Cybernetics, 1972, SMC-2(3):408-421.
[15] BARANDELA R, GASCA E. Decontamination of training samples for supervised pattern recognition methods[C]//Proceedings of the Joint IAPR International Workshops on Advances in Pattern Recognition. Alicante, Spain:Springer, 2000:621-630.
[16] HART P. The condensed nearest neighbor rule (Corresp.)[J]. IEEE Transactions on Information Theory, 1968, 14(3):515-516.
[17] CAO J J, KWONG S, WANG R. A noise detection based AdaBoost algorithm for mislabeled data[J]. Pattern Recognition, 2012, 45(12):4451-4465.
[18] SLUBAN B, GAMBERGER D, LAVRAC N. Ensemble-based noise detection:Noise ranking and visual performance evaluation[J]. Data Mining and Knowledge Discovery, 2014, 28(2):265-303.
[19] EKAMBARAM R, FEFILATYEV S, SHREVE M, et al. Active cleaning of label noise[J]. Pattern Recognition, 2016, 51:463-480.
[20] DUA D, KARRA TANISKIDOU E. UCI machine learning repository[EB/OL].[2017-11-05]. http://archive.ics.uci.edu/ml.
[1] 屠守中, 杨婧, 赵林, 朱小燕. 半监督的微博话题噪声过滤方法[J]. 清华大学学报(自然科学版), 2019, 59(3): 178-185.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 《清华大学学报(自然科学版)》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn