Please wait a minute...
 首页  期刊介绍 期刊订阅 联系我们 横山亮次奖 百年刊庆
 
最新录用  |  预出版  |  当期目录  |  过刊浏览  |  阅读排行  |  下载排行  |  引用排行  |  横山亮次奖  |  百年刊庆
清华大学学报(自然科学版)  2023, Vol. 63 Issue (5): 740-753    DOI: 10.16511/j.cnki.qhdxxb.2023.25.003
  大数据 本期目录 | 过刊浏览 | 高级检索 |
针对大规模数据的分布一致缺失值插补算法
余嘉茵1, 何玉林1,2, 崔来中1,2, 黄哲学1,2
1. 深圳大学 计算机与软件学院, 大数据所, 深圳 518060;
2. 广东省人工智能与数字经济实验室(深圳), 深圳 518107
Distribution consistency-based missing value imputation algorithm for large-scale data sets
YU Jiayin1, HE Yulin1,2, CUI Laizhong1,2, HUANG Zhexue1,2
1. Big Data Institute, College of Computer Science & Software Engineering, Shenzhen University, Shenzhen 518060, China;
2. Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen 518107, China
全文: PDF(5395 KB)   HTML
输出: BibTeX | EndNote (RIS)      
摘要 缺失值插补(missing value imputation,MVI)作为数据挖掘领域的重要研究分支,旨在为机器学习算法的训练提供高质量的数据支持。不同于现有的以算法性能提升为导向的MVI算法,为对大规模数据的缺失值进行有效插补,该文提出一种以数据结构还原为导向的数据分布一致MVI (distribution consistency-based MVI,DC-MVI)算法。首先,DC-MVI算法基于概率分布一致性原则构建了用于确定最优插补值的目标函数;其次,利用推导出的可行缺失值优化规则获取与原始完整值保持最大分布一致性且方差最为接近的插补值;最后,在分布式环境下,针对大数据的随机样本划分(random sample partition,RSP)数据块并行训练DC-MVI算法,获得大规模数据缺失值对应的插补值。实验结果表明:DC-MVI算法不仅能生成与原始完整值保持给定显著性水平下概率分布一致的插补值,还具有比另外5种经典的和3种最新的MVI算法更快的插补速度和更好的插补效果,进而证实DC-MVI算法是一种可行的大规模数据MVI算法。
服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
余嘉茵
何玉林
崔来中
黄哲学
关键词 文字信息处理缺失值插补分布一致性最大均值差异大规模数据随机样本划分分布式计算    
Abstract:[Objective] As a significant research branch in the field of data mining, missing value imputation (MVI) aims to provide high-quality data support for the training of machine learning algorithms. However, MVI results for large-scale data sets are not ideal in terms of restoring data distribution and improving data prognosis accuracy. To improve the performance of the existing MVI algorithms, we propose a distribution consistency-based MVI (DC-MVI) algorithm that attempts to restore the original data structure by imputing the missing values for large-scale data sets.[Methods] First, the DC-MVI algorithm developed an objective function to determine the optimal imputation values based on the principle of probability distribution consistency. Second, the data set is preprocessed by random initialization of missing values and normalization, and a feasible missing value update rule is derived to obtain the imputation values with the closest variance and the greatest consistency with the complete original values. Next, in a distributed environment, the large-scale data set is divided into multiple groups of random sample partition (RSP) data blocks with the same distribution as the entire data set by taking into account the statistical properties of the large-scale data set. Finally, the DC-MVI algorithm is trained in parallel to obtain the imputation value corresponding to the missing value of the large-scale data set and preserve distribution consistency with the non-missing values. The rationality experiments verify the convergence of the objective function and the contribution of DC-MVI to distribution consistency. In addition, the effectiveness experiments assess the performance of DC-MVI and eight other MVI algorithms (mean, KNN, MICE, RF, EM, SOFT, GAIN, and MIDA) through the following three indicators:distribution consistency, time complexity, and classification accuracy.[Results] The experimental results on seven selected large-scale data sets showed that:1) The objective function of the DC-MVI method was effective, and the missing value update rule was feasible, allowing the imputation values to remain stable throughout the adjustment process; 2) the DC-MVI algorithm obtained the smallest maximum mean discrepancy and Jensen-Shannon divergence on all data sets, showing that the proposed method had a more consistent probability distribution with the complete original values under the given significance level; 3) the running time of the DC-MVI algorithm tended to be stable in the time comparison experiment, whereas the running time of other state-of-the-art MVI methods increased linearly with data volume; 4) the DC-MVI approach could produce imputation values that were more consistent with the original data set compared to existing methods, which was beneficial for subsequent data mining analysis.[Conclusions] Considering the peculiarities and limitations of missing large-scale data, this paper incorporates RSP into the imputation algorithm and derives the update rules of imputation values to restore the data distribution and further confirm the effectiveness and practical performance of DC-MVI in the large-scale data set imputation, such as preserving distribution consistency and increasing imputation quality. The method proposes in this paper achieves the desired result and represents a viable solution to the problem of large-scale data imputation.
Key wordsword information processing    missing value imputation    distribution consistency    maximum mean discrepancy    large-scale data    random sample partition    distributed computing
收稿日期: 2022-09-25      出版日期: 2023-04-23
基金资助:国家自然科学基金面上项目(61972261);广东省自然科学基金面上项目(2314050006683);深圳市基础研究重点项目(JCYJ20220818100205012);深圳市基础研究面上项目(JCYJ20210324093609026)
通讯作者: 何玉林,副研究员,E-mail:yulinhe@gml.ac.cn      E-mail: yulinhe@gml.ac.cn
作者简介: 余嘉茵(1998-),女,硕士研究生。
引用本文:   
余嘉茵, 何玉林, 崔来中, 黄哲学. 针对大规模数据的分布一致缺失值插补算法[J]. 清华大学学报(自然科学版), 2023, 63(5): 740-753.
YU Jiayin, HE Yulin, CUI Laizhong, HUANG Zhexue. Distribution consistency-based missing value imputation algorithm for large-scale data sets. Journal of Tsinghua University(Science and Technology), 2023, 63(5): 740-753.
链接本文:  
http://jst.tsinghuajournals.com/CN/10.16511/j.cnki.qhdxxb.2023.25.003  或          http://jst.tsinghuajournals.com/CN/Y2023/V63/I5/740
  
  
  
  
  
  
  
  
  
  
  
[1] SCHEET P, STEPHENS M. A fast and flexible statistical model for large-scale population genotype data:Applications to inferring missing genotypes and haplotypic phase[J]. The American Journal of Human Genetics, 2006, 78(4):629-644.
[2] JEREZ J M, MOLINA I, GARCÍA-LAENCINA P J, et al. Missing data imputation using statistical and machine learning methods in a real breast cancer problem[J]. Artificial Intelligence in Medicine, 2010, 50(2):105-115.
[3] YUAN H W, XU G M, YAO Z J, et al. Imputation of missing data in time series for air pollutants using long short-term memory recurrent neural networks[C]//Proceedings of the 2018 ACM International Joint Conference and 2018 International Symposium on Pervasive and Ubiquitous Computing and Wearable Computers. Singapore:ACM, 2018:1293-1300.
[4] VAN BUUREN S, GROOTHUIS-OUDSHOORN K. MICE:Multivariate imputation by chained equations in R[J]. Journal of Statistical Software, 2011, 45(3):1-67.
[5] NGUYEN C D, CARLIN J B, LEE K J. Model checking in multiple imputation:An overview and case study[J]. Emerging Themes in Epidemiology, 2017, 14(1):1-12.
[6] ZHANG Z H. Missing data imputation:Focusing on single imputation[J]. Annals of Translational Medicine, 2016, 4(1):9.
[7] KHOSRAVI P, LIANG Y T, CHOI Y J, et al. What to expect of classifiers? Reasoning about logistic regression with missing features[C]//Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19). Macao, China:IJCAI, 2019:2716-2724.
[8] SEFIDIAN A M, DANESHPOUR N. Missing value imputation using a novel grey based fuzzy c-means, mutual information based feature selection, and regression model[J]. Expert Systems with Applications, 2019, 115:68-94.
[9] ZHU X F, ZHANG S C, JIN Z, et al. Missing value estimation for mixed-attribute data sets[J]. IEEE Transactions on Knowledge and Data Engineering, 2011, 23(1):110-121.
[10] LIU X W, ZHU X Z, LI M M, et al. Multiple kernel k-means with incomplete kernels[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(5):1191-1204.
[11] CHEN X Y, YANG J M, SUN L J. A nonconvex low-rank tensor completion model for spatiotemporal traffic data imputation[J]. Transportation Research Part C:Emerging Technologies, 2020, 117:102673.
[12] ZHU Y M, LI Z, ZHU H Z, et al. A compressive sensing approach to urban traffic estimation with probe vehicles[J]. IEEE Transactions on Mobile Computing, 2013, 12(11):2289-2302.
[13] YU J R, STETTLER M E J, ANGELOUDIS P, et al. Urban network-wide traffic speed estimation with massive ride-sourcing GPS traces[J]. Transportation Research Part C:Emerging Technologies, 2020, 112:136-152.
[14] HUANG J L, KEUNG J W, SARRO F, et al. Cross-validation based K nearest neighbor imputation for software quality datasets:An empirical study[J]. Journal of Systems and Software, 2017, 132:226-252.
[15] AL-HELALI B, CHEN Q, XUE B, et al. A hybrid GP-KNN imputation for symbolic regression with missing values[C]//Australasian Joint Conference on Artificial Intelligence. Cham:Springer, 2018:345-357.
[16] CHOUDHURY S J, PAL N R. Imputation of missing data with neural networks for classification[J]. Knowledge-Based Systems, 2019, 182:104838.
[17] YOON J, JORDON J, VAN DER SCHAAR M. GAIN:Missing data imputation using generative adversarial nets[C]//Proceeding of the 35th International Conference on Machine Learning. Stockholm, Sweden:PMLR, 2018:5689-5698.
[18] MA Q, LEE W C, FU T Y, et al. MIDIA:Exploring denoising autoencoders for missing data imputation[J]. Data Mining and Knowledge Discovery, 2020, 34(6):1859-1897.
[19] LIN W C, TSAI C F, ZHONG J R. Deep learning for missing value imputation of continuous data and the effect of data discretization[J]. Knowledge-Based Systems, 2022, 239:108079.
[20] HE K M, CHEN X L, XIE S N, et al. Masked autoencoders are scalable vision learners[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA:IEEE, 2022:16000-16009.
[21] BADSHA M B, LI R, LIU B X, et al. Imputation of single-cell gene expression with an autoencoder neural network[J]. Quantitative Biology, 2020, 8(1):78-94.
[22] BANSAL P, DESHPANDE P, SARAWAGI S. Missing value imputation on multidimensional time series[J]. Proceedings of the VLDB Endowment, 2021, 14(11):2533-2545.
[23] 孟小峰, 慈祥. 大数据管理:概念、技术与挑战[J]. 计算机研究与发展, 2013, 50(1):146-169. MENG X F, CI X. Big data management:Concept, techniques and challenge[J]. Journal of Computer Research and Development, 2013, 50(1):146-169. (in Chinese)
[24] CHEN M, MAO S, ZHANG Y, et al. Big data storage[M]. Cham:Springer, 2014:33-49.
[25] SHVACHKO K, KUANG H R, RADIA S, et al. The hadoop distributed file system[C]//2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST). Incline Village, USA:IEEE, 2010:1-10.
[26] DEAN J, GHEMAWAT S. MapReduce:Simplified data processing on large clusters[C]//Proceedings of the 6th Conference on Symposium on Operating Systems Design & Implementation. San Francisco, USA:ACM, 2004:137-150.
[27] 黄哲学, 何玉林, 魏丞昊, 等. 大数据随机样本划分模型及相关分析计算技术[J]. 数据采集与处理, 2019, 34(3):373-385. HUANG Z X, HE Y L, WEI C H, et al. Random sample partition data model and related technologies for big data analysis[J]. Journal of Data Acquisition and Processing, 2019, 34(3):373-385. (in Chinese)
[28] STRIKE K, EL-EMAM K, MADHAVJI N. Software cost estimation with incomplete data[J]. IEEE Transactions on Software Engineering, 2001, 27(10):890-908.
[29] ZHANG S C, QIN Z, LING C X, et al. "Missing is useful":Missing values in cost-sensitive decision trees[J]. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(12):1689-1693.
[30] LITTLE R J A, RUBIN D B. Statistical analysis with missing data[M]. Hoboken:John Wiley & Sons, 2019.
[31] MORRIS T P, WHITE I R, ROYSTON P. Tuning multiple imputation by predictive mean matching and local residual draws[J]. BMC Medical Research Methodology, 2014, 14(1):75.
[32] LI D W, ZHANG H Q, LI T R, et al. Hybrid missing value imputation algorithms using fuzzy C-means and vaguely quantified rough set[J]. IEEE Transactions on Fuzzy Systems, 2022, 30(5):1396-1408.
[33] REZVAN P H, LEE K J, SIMPSON J A. The rise of multiple imputation:A review of the reporting and implementation of the method in medical research[J]. BMC Medical Research Methodology, 2015, 15(1):30.
[34] HUGHES R A, HERON J, STERNE J A C, et al. Accounting for missing data in statistical analyses:Multiple imputation is not always the answer[J]. International Journal of Epidemiology, 2019, 48(4):1294-1304.
[35] ZHANG S C. Nearest neighbor selection for iteratively KNN imputation[J]. Journal of Systems and Software, 2012, 85(11):2541-2552.
[36] KEERIN P, KURUTACH W, BOONGOEN T. Cluster-based KNN missing value imputation for DNA microarray data[C]//2012 IEEE International Conference on Systems, Man, and Cybernetics (SMC). Korea:IEEE, 2012:445-450.
[37] VINCENT P, LAROCHELLE H, BENGIO Y, et al. Extracting and composing robust features with denoising autoencoders[C]//Proceedings of the 25th International Conference on Machine Learning. Helsinki, Finland:ACM, 2008:1096-1103.
[38] GONDARA L, WANG K. MIDA:Multiple imputation using denoising autoencoders[C]//Proceeding of the 22nd Pacific-Asia Conference on Knowledge Discovery and Data Mining. Melbourne, Australia:Springer, 2018:260-272.
[39] SOVILJ D, EIROLA E, MICHE Y, et al. Extreme learning machine for missing data using multiple imputations[J]. Neurocomputing, 2016, 174:220-231.
[40] LU C B, MEI Y. An imputation method for missing data based on an extreme learning machine auto-encoder[J]. IEEE Access, 2018, 6:52930-52935.
[41] GARCÍA J C F, KALENATIC D, BELLO C A L. Missing data imputation in multivariate data by evolutionary algorithms[J]. Computers in Human Behavior, 2011, 27(5):1468-1474.
[42] ZHANG Y, ZHOU B H, CAI X R, et al. Missing value imputation in multivariate time series with end-to-end generative adversarial networks[J]. Information Sciences, 2021, 551:67-82.
[43] DEMPSTER A P, LAIRD N M, RUBIN D B. Maximum likelihood from incomplete data via theEM algorithm[J]. Journal of the Royal Statistical Society:Series B (Methodological), 1977, 39(1):1-22.
[44] RAHMAN M G, ISLAM M Z. Missing value imputation using a fuzzy clustering-based EM approach[J]. Knowledge and Information Systems, 2016, 46(2):389-422.
[45] GOLD M S, BENTLER P M. Treatments of missing data:A Monte Carlo comparison of RBHDI, iterative stochastic regression imputation, and expectation-maximization[J]. Structural Equation Modeling:A Multidisciplinary Journal, 2000, 7(3):319-355.
[46] KURUCZ M, BENCZỨR A A, CSALOGÁNY K. Methods for large scale SVD with missing values[C]//Proceedings of KDD Cup and Workshop. San Jose, USA:ACM, 2007, 12:31-38.
[47] MAZUMDER R, HASTIE T, TIBSHIRANI R. Spectral regularization algorithms for learning large incomplete matrices[J]. The Journal of Machine Learning Research, 2010, 11:2287-2322.
[48] ZHU X S, WANG J Y, SUN B, et al. An efficient ensemble method for missing value imputation in microarray gene expression data[J]. BMC Bioinformatics, 2021, 22(1):188.
[49] PETROZZIELLO A, JORDANOV I, SOMMEREGGER C.Distributed neural networks for missing big data imputation[C]//2018 International Joint Conference on Neural Networks (IJCNN). Rio de Janeiro, Brazil:IEEE, 2018:1-8.
[50] 何玉林, 黄德发, 戴德鑫, 等. 最大均方差异统计量的一般界[J]. 应用数学, 2021, 34(2):284-288. HE Y L, HUANG D F, DAI D X, et al. General bounds for maximum mean discrepancy statistics[J]. Mathematica Applicata, 2021, 34(2):284-288. (in Chinese)
[51] GRETTON A, BORGWARDT K M, RASCH M J, et al. A kernel two-sample test[J]. The Journal of Machine Learning Research, 2012, 13(1):723-773.
[52] TANG F, ISHWARAN H. Random forest missing data algorithms[J]. Statistical Analysis and Data Mining:The ASA Data Science Journal, 2017, 10(6):363-377.
[53] OpenIDEA-Yunan University, Ycimpute[EB/OL]. (2020-08-23)[2022-09-25]. https://github.com/OpenIDEA-YunanUniversity/ycimpute.
[54] RUBINSTEYN A, FELDMAN S. Fancyimpute:An imputation library for python[EB/OL].(2021-10-22)[2022-09-25]. https://github.com/iskandr/fancyimpute.
[55] UC Irvine Machine Leaning Repository. UCI machine learning repository[EB/OL]. (2021-05-02)[2022-09-25]. http://archive.ics.uci.edu/ml.
[56] TRIGUERO I, GONZÁLEZ S, MOYANO J M, et al. KEEL 3.0:An open source software for multi-stage analysis in data mining[J]. International Journal of Computational Intelligence Systems, 2017, 10(1):1238-1249.
[57] WHITE I R, ROYSTON P, WOOD A M. Multiple imputation using chained equations:Issues and guidance for practice[J]. Statistics in Medicine, 2011, 30(4):377-399.
[1] 张婧, 黄德根, 黄锴宇, 刘壮, 孟祥主. 基于λ-主动学习方法的中文微博分词[J]. 清华大学学报(自然科学版), 2018, 58(3): 260-265.
[2] 田文洪, 李国忠, 陈瑜, 黄超杰, 杨吴同. 一种兼顾负载均衡的Hadoop集群动态节能方法[J]. 清华大学学报(自然科学版), 2016, 56(11): 1226-1231.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 《清华大学学报(自然科学版)》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn