Please wait a minute...
 首页  期刊介绍 期刊订阅 联系我们
 
最新录用  |  预出版  |  当期目录  |  过刊浏览  |  阅读排行  |  下载排行  |  引用排行  |  百年期刊
Journal of Tsinghua University(Science and Technology)    2018, Vol. 58 Issue (12) : 1037-1050     DOI: 10.16511/j.cnki.qhdxxb.2018.22.053
COMPUTER SCIENCE AND TECHNOLOGY |
Survey of structured data cleaning methods
HAO Shuang1,2, LI Guoliang2, FENG Jianhua2, WANG Ning1
1. School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China;
2. Database Group, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
Download: PDF(5925 KB)  
Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks    
Abstract  Data cleaning is the process of detecting and repairing dirty data which is often needed in data analysis and management. This paper classifies and summarizes the traditional and advanced data cleaning techniques and identifies potential directions for further work. This study first formally defines the cleaning problem for structured data and then describes error detection methods for missing data, redundant data, conflicting data and erroneous data. The data cleaning methods are then summarized based on their error elimination method, including constraint-based data cleaning, rule-based data cleaning, statistical data cleaning and human-in-the-loop data cleaning. Some important datasets and noise injection tools are introduced as well. Open research problems and future research directions are also discussed.
Keywords data cleaning      dirty data      error detection      error elimination     
Issue Date: 13 December 2018
Service
E-mail this article
E-mail Alert
RSS
Articles by authors
HAO Shuang
LI Guoliang
FENG Jianhua
WANG Ning
Cite this article:   
HAO Shuang,LI Guoliang,FENG Jianhua, et al. Survey of structured data cleaning methods[J]. Journal of Tsinghua University(Science and Technology), 2018, 58(12): 1037-1050.
URL:  
http://jst.tsinghuajournals.com/EN/10.16511/j.cnki.qhdxxb.2018.22.053     OR     http://jst.tsinghuajournals.com/EN/Y2018/V58/I12/1037
  
  
  
  
  
  
  
  
  
[1] SHALLCROSS S. 2 Reasons why your data is lying to you[R/OL]. (2016-09-15)[2018-06-25]. http://hollistibbetts.sys-con.com/node/1975126.
[2] BUCKLEY J. Causes of dirty data and how to combat them[R/OL]. (2018-07-13)[2018-06-25]. https://www.qubole.com/blog/causes-of-dirty-data-and-how-to-combat-them.
[3] GALHARDAS H, FLORESCU D, SHASHA D, et al. An extensible framework for data cleaning[C]//IEEE 16th International Conference on Data Engineering. San Diego, USA, 2000.
[4] REKATSINAS T, CHU X, ILYAS I F, et al. HoloClean:Holistic data repairs with probabilistic inference[J]. Proceedings of the VLDB Endowment, 2017, 10(11):1190-1201.
[5] MAYFIELD C, NEVILLE J, PRABHAKAR S. ERACER:A database approach for statistical inference and data cleaning[C]//2010 ACM SIGMOD International Conference on Management of Data. Indianapolis, USA, 2010:75-86.
[6] YAKOUT M, BERTI-ÉQUILLE L, ELMAGARMID A K. Don't be SCAREd:Use scalable automatic repairing with maximal likelihood and bounded changes[C]//2013 ACM SIGMOD International Conference on Management of Data. New York, USA, 2013:553-564.
[7] WANG J N, KRASKA T, FRANKLIN M J, et al. CrowdER:Crowdsourcing entity resolution[J]. Proceedings of the VLDB Endowment, 2012, 5(11):1483-1494.
[8] CHAI C, LI G L, LI J, et al. Cost-effective crowdsourced entity resolution:A partial-order approach[C]//2016 International Conference on Management of Data. San Francisco, USA, 2016:969-984.
[9] 郭志懋, 周傲英. 数据质量和数据清洗研究综述[J]. 软件学报, 2002, 13(11):2076-2082. GUO Z M, ZHOU A Y. Research on data quality and data cleaning:A survey[J]. Journal of Software, 2002, 13(11):2076-2082. (in Chinese)
[10] 杨辅祥, 刘云超, 段智华. 数据清理综述[J]. 计算机应用研究, 2002, 19(3):3-5. YANG F X, LIU Y C, DUAN Z H. An overview of data cleaning[J]. Application Research of Computers, 2002, 19(3):3-5. (in Chinese)
[11] 王曰芬, 章成志, 张蓓蓓, 等. 数据清洗研究综述[J]. 现代图书情报技术, 2007, 2(12):50-56. WANG Y F, ZHANG C Z, ZHANG B B, et al. A survey of data cleaning[J]. New Technology of Library and Information Service, 2007, 2(12):50-56. (in Chinese)
[12] 赵一凡, 卞良, 丛昕. 数据清洗方法研究综述[J]. 软件导刊, 2017(12):222-224. ZHAO Y F, BIAN L, CONG X. Survey of data cleaning method in data preprocessing[J]. Software Guide, 2017(12):222-224. (in Chinese)
[13] ELMAGARMID A K, IPEIROTIS P G, VERYKIOS V S. Duplicate record detection:A survey[J]. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(1):1-16.
[14] CHU X, ILYAS I F, KRISHNAN S, et al. Data cleaning:Overview and emerging challenges[C]//2016 International Conference on Management of Data. San Francisco, USA, 2016:2201-2206.
[15] ILYAS I F, CHU X. Trends in cleaning relational data:Consistency and deduplication[J]. Foundations and Trends in Databases, 2015, 5(4):281-393.
[16] FAN W F, GEERTS F, JIA X B, et al. Conditional functional dependencies for capturing data inconsistencies[J]. ACM Transactions on Database Systems (TODS), 2008, 33(2):6.
[17] BOHANNON P, FAN W F, GEERTS F, et al. Conditional functional dependencies for data cleaning[C]//IEEE 23rd International Conference on Data Engineering. Istanbul, Turkey, 2007:746-755.
[18] ABITEBOUL S, HULL R, VIANU V. Foundations of databases:The logical level[M]. Boston, USA:Addison-Wesley Longman Publishing, 1995.
[19] KOUDAS N, SAHA A, SRIVASTAVA D, et al. Metric functional dependencies[C]//IEEE 25th International Conference on Data Engineering. Shanghai, China, 2009:1275-1278.
[20] SONG S X, CHEN L. Differential dependencies:Reasoning and discovery[J]. ACM Transactions on Database Systems (TODS), 2011, 36(3):16.
[21] WANG J N, TANG N. Towards dependable data repairing with fixing rules[C]//2014 ACM SIGMOD International Conference on Management of Data. Snowbird, USA, 2014:457-468.
[22] WANG J N, TANG N. Dependable data repairing with fixing rules[J]. Journal of Data and Information Quality (JDIQ), 2017, 8(3-4):16.
[23] FAN W F, LI J Z, MA S, et al. Towards certain fixes with editing rules and master data[J]. Proceedings of the VLDB Endowment, 2010, 3(1-2):173-184.
[24] FAN W F, LI J Z, MA S, et al. Towards certain fixes with editing rules and master data[J]. The VLDB Journal, 2012, 21(2):213-238.
[25] INTERLANDI M, TANG N. Proof positive and negative in data cleaning[C]//IEEE 31st International Conference on Data Engineering. Seoul, Republic of Korea, 2015:18-29.
[26] HAO S, TANG N, LI G L, et al. Cleaning relations using knowledge bases[C]//IEEE 33rd International Conference on Data Engineering. San Diego, USA, 2017:933-944.
[27] HAO S, TANG N, LI G L, et al. Distilling relations using knowledge bases[J]. The VLDB Journal, 2018, 27(4):497-519.
[28] HUA M, PEI J. DiMaC:A disguised missing data cleaning tool[C]//14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Las Vegas, USA, 2008:1077-1080.
[29] QAHTAN A A, ELMAGARMID A, CASTRO FERNANDEZ R, et al. FAHES:A robust disguised missing values detector[C]//24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. London, UK, 2018:2100-2109.
[30] KÖPCKE H, RAHM E. Frameworks for entity matching:A comparison[J]. Data & Knowledge Engineering, 2010, 69(2):197-210.
[31] HE Q L, LI Z H, ZHANG X. Data deduplication techniques[C]//2010 IEEE International Conference on Future Information Technology and Management Engineering (FITME). Changzhou, China, 2010:430-433.
[32] BAXTER R, CHRISTEN P, CHURCHES T. A comparison of fast blocking methods for record linkage[C]//2003 KDD Workshop on Data Cleaning, Record Linkage, and Object Consolidation. Washington, DC, USA, 2003, 3:25-27.
[33] TEJADA S, KNOBLOCK C A, MINTON S. Learning object identification rules for information integration[J]. Information Systems, 2001, 26(8):607-633.
[34] ELFEKY M G, VERYKIOS V S, ELMAGARMID A K. TAILOR:A record linkage toolbox[C]//18th International Conference on Data Engineering. San Jose, USA, 2002.
[35] TEJADA S, KNOBLOCK C A, MINTON S. Learning domain-independent string transformation weights for high accuracy object identification[C]//International Conference on Knowledge Discovery and Data Mining. Edmonton, Canada, 2002:350-359.
[36] CHRISTEN P. Febrl:A freely available record linkage system with a graphical user interface[C]//2nd Australasian Workshop on Health Data and Knowledge Management. Wollongong, Australia, 2008:17-25.
[37] MCCALLUM A, NIGAM K, UNGAR L H. Efficient clustering of high-dimensional data sets with application to reference matching[C]//6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Boston, USA, 2000:169-178.
[38] BAYARDO R J, MA Y M, SRIKANT R. Scaling up all pairs similarity search[C]//16th International Conference on World Wide Web. Banff, Canada, 2007:131-140.
[39] CHAUDHURI S, GANTI V, KAUSHIK R. A primitive operator for similarity joins in data cleaning[C]//22nd International Conference on Data Engineering. Atlanta, USA, 2006.
[40] XIAO C, WANG W, LIN X M, et al. Efficient similarity joins for near-duplicate detection[J]. ACM Transactions on Database Systems (TODS), 2011, 36(3):15.
[41] ARASU A, RÉ C, SUCIU D. Large-scale deduplication with constraints using Dedupalog[C]//2009 IEEE 25th International Conference on Data Engineering. Shanghai, China, 2009:952-963.
[42] HERNÁNDEZ M A, STOLFO S J. The merge/purge problem for large databases[C]//1995 ACM SIGMOD International Conference on Management of Data. San Jose, USA, 1995:127-138.
[43] LEE M L, LING T W, LOW W L. IntelliClean:A knowledge-based intelligent data cleaner[C]//Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Boston, USA, 2000:290-294.
[44] WANG J N, LI G L, YU J X, et al. Entity matching:How similar is similar[J]. Proceedings of the VLDB Endowment, 2011, 4(10):622-633.
[45] BILENKO M, MOONEY R J. Adaptive duplicate detection using learnable string similarity measures[C]//Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Washington, DC, USA, 2003:39-48.
[46] PAPENBROCK T, EHRLICH J, MARTEN J, et al. Functional dependency discovery:An experimental evaluation of seven algorithms[J]. Proceedings of the VLDB Endowment, 2015, 8(10):1082-1093.
[47] LIU J X, LI J Y, LIU C F, et al. Discover dependencies from data:A review[J]. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(2):251-264.
[48] HUHTALA Y, KÄRKKÄINEN J, PORKKA P, et al. TANE:An efficient algorithm for discovering functional and approximate dependencies[J]. The Computer Journal, 1999, 42(2):100-111.
[49] YAO H, HAMILTON H J. Mining functional dependencies from data[J]. Data Mining and Knowledge Discovery, 2008, 16(2):197-219.
[50] NOVELLI N, CICCHETTI R. FUN:An efficient algorithm for mining functional and embedded dependencies[C]//International Conference on Database Theory. London, UK, 2001:189-203.
[51] WYSS C, GIANNELLA C, ROBERTSON E. FastFDs:A heuristic-driven, depth-first algorithm for mining functional dependencies from relation instances extended abstract[C]//International Conference on Data Warehousing and Knowledge Discovery. Munich, Germany, 2001:101-110.
[52] LOPES S, PETIT J M, LAKHAL L. Efficient discovery of functional dependencies and Armstrong relations[C]//International Conference on Extending Database Technology. Konstanz, Germany, 2000:350-364.
[53] FAN W F, GEERTS F, LI J Z, et al. Discovering conditional functional dependencies[J]. IEEE Transactions on Knowledge and Data Engineering, 2011, 23(5):683-698.
[54] DE MARCHI F, LOPES S, PETIT J M. Efficient algorithms for mining inclusion dependencies[C]//8th International Conference on Extending Database Technology. Prague, Czech Republic, 2002:464-476.
[55] BOHANNON P, FAN W F, FLASTER M, et al. A cost-based model and effective heuristic for repairing constraints by value modification[C]//2005 ACM SIGMOD International Conference on Management of Data. Baltimore, USA, 2005:143-154.
[56] CHIANG F, MILLER R J. A unified model for data and constraint repair[C]//2011 IEEE 27th International Conference on Data Engineering. Hannover, Germany, 2011:446-457.
[57] CHU X, ILYAS I F, PAPOTTI P. Holistic data cleaning:Putting violations into context[C]//2013 IEEE 29th International Conference on Data Engineering (ICDE). Brisbane, Australia, 2013:458-469.
[58] KOLAHI S, LAKSHMANAN L V S. On approximating optimum repairs for functional dependency violations[C]//12th International Conference on Database Theory. St. Petersburg, Russia, 2009:53-62.
[59] HAO S, TANG N, LI G L, et al. A novel cost-based model for data repairing[J]. IEEE Transactions on Knowledge and Data Engineering, 2017, 29(4):727-742.
[60] ARENAS M, BERTOSSI L, CHOMICKI J. Consistent query answers in inconsistent databases[C]//18th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. Philadelphia, USA, 1999:68-79.
[61] BRAVO L, BERTOSSI L. Logic programs for consistently querying data integration systems[C]//18th International Joint Conference on Artificial Intelligence. Acapulco, Mexico, 2003:10-15.
[62] CALÍ A, LEMBO D, ROSATI R. On the decidability and complexity of query answering over inconsistent and incomplete databases[C]//22nd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. San Diego, USA, 2003:260-271.
[63] CALI A, LEMBO D, ROSATI R. Query rewriting and answering under constraints in data integration systems[C]//18th International Joint Conference on Artificial Intelligence. Acapulco, Mexico, 2003:16-21.
[64] CHOMICKI J, MARCINKOWSKI J. Minimal-change integrity maintenance using tuple deletions[J]. Information and Computation, 2005, 197(1-2):90-121.
[65] GERTZ M, LIPECK U W. A diagnostic approach for repairing constraint violations in databases[M]//JAJODIA S, LIST W, MCGREGOR G, et al, eds. Integrity and internal control in information systems. ⅡCIS 1997. Boston, USA:Springer, 1997, 1:89-111
[66] GRECO G, GRECO S, ZUMPANO E. A logical framework for querying and repairing inconsistent databases[J]. IEEE Transactions on Knowledge and Data Engineering, 2003, 15(6):1389-1408.
[67] YAKOUT M, ELMAGARMID A K, NEVILLE J, et al. Guided data repair[J]. Proceedings of the VLDB Endowment, 2011, 4(5):279-289.
[68] CHU X, MORCOS J, ILYAS I F, et al. KATARA:A data cleaning system powered by knowledge bases and crowdsourcing[C]//2015 ACM SIGMOD International Conference on Management of Data. Melbourne, Australia, 2015:1247-1261.
[69] COHEN W, RAVIKUMAR P, FIENBERG S. A comparison of string metrics for matching names and records[C]//KDD Workshop on Data Cleaning and Object Consolidation. Washington, DC, USA, 2003:73-78.
[70] JIANG Y, LI G L, FENG J H, et al. String similarity joins:An experimental evaluation[J]. Proceedings of the VLDB Endowment, 2014, 7(8):625-636.
[71] EBAID A, ELMAGARMID A, ILYAS I F, et al. NADEEF:A generalized data cleaning system[J]. Proceedings of the VLDB Endowment, 2013, 6(12):1218-1221.
[72] CONG G, FAN W F, GEERTS F, et al. Improving data quality:Consistency and accuracy[C]//33rd International Conference on Very Large Data Bases. Vienna, Austria, 2007:315-326.
[73] GEERTS F, MECCA G, PAPOTTI P, et al. The LLUNATIC data-cleaning framework[J]. Proceedings of the VLDB Endowment, 2013, 6(9):625-636.
[74] BESKALES G, ILYAS I F, GOLAB L. Sampling the repairs of functional dependency violations under hard constraints[J]. Proceedings of the VLDB Endowment, 2010, 3(1-2):197-207.
[75] KHAYYAT Z, ILYAS I F, JINDAL A, et al. BIGDANSING:A system for big data cleansing[C]//2015 ACM SIGMOD International Conference on Management of Data. Melbourne, Australia, 2015:1215-1230.
[76] BESKALES G, ILYAS I F, GOLAB L, et al. On the relative trust between inconsistent data and inaccurate constraints[C]//2013 IEEE 29th International Conference on Data Engineering. Brisbane, Australia, 2013:541-552.
[77] VOLKOVS M, CHIANG F, SZLICHTA J, et al. Continuous data cleaning[C]//2014 IEEE 30th International Conference on Data Engineering. Chicago, USA, 2014:244-255.
[78] RISSANEN J. Modeling by shortest data description[J]. Automatica, 1978, 14(5):465-471.
[79] COVER T M, THOMAS J A. Elements of information theory[M]. New York, USA:John Wiley & Sons, 1991.
[80] HE J, VELTRI E, SANTORO D, et al. Interactive and deterministic data cleaning[C]//2016 International Conference on Management of Data. San Francisco, USA, 2016:893-907.
[81] AROCENA P C, GLAVIC B, MECCA G, et al. Messing up with BART:Error generation for evaluating data-cleaning algorithms[J]. Proceedings of the VLDB Endowment, 2015, 9(2):36-47.
[82] ANANTHAKRISHNA R, CHAUDHURI S, GANTI V. Eliminating fuzzy duplicates in data warehouses[C]//28th International Conference on Very Large Databases. Hong Kong, China, 2002:586-597.
[83] WANG J N, KRISHNAN S, FRANKLIN M J, et al. A sample-and-clean framework for fast and accurate query processing on dirty data[C]//2014 ACM SIGMOD International Conference on Management of Data. Snowbird, USA, 2014:469-480.
[84] KRISHNAN S, WANG J N, FRANKLIN M J, et al. SampleClean:Fast and reliable analytics on dirty data[C/OL]//Bulletin of the IEEE Computer Society Technical Committee on Data Engineering. 2015:59-75[2018-06-19]. https://www.ocf.berkeley.edu/~sanjayk/old/pubs/sc.pdf.
[85] LI G L, WANG J N, ZHENG Y D, et al. Crowdsourced data management:A survey[C]//2017 IEEE 33rd International Conference on Data Engineering. San Diego, USA, 2017:39-40.
[86] PARK H, WIDOM J. CrowdFill:Collecting structured data from the crowd[C]//2014 ACM SIGMOD International Conference on Management of Data. Snowbird, USA, 2014:577-588.
[87] WHANG S E, LOFGREN P, GARCIA-MOLINA H. Question selection for crowd entity resolution[J]. Proceedings of the VLDB Endowment, 2013, 6(6):349-360.
[88] TONG Y X, CAO C C, ZHANG C J, et al. CrowdCleaner:Data cleaning for multi-version data on the web via crowdsourcing[C]//2014 IEEE 30th International Conference on Data Engineering. Chicago, USA, 2014:1182-1185.
[89] MICHALSKI R S. A theory and methodology of inductive learning[M]//MICHALSKI R S, CARBONELL J G, MITCHELL T M. Machine learning. Berlin, Germany:Springer, 1983:83-134.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
Copyright © Journal of Tsinghua University(Science and Technology), All Rights Reserved.
Powered by Beijing Magtech Co. Ltd