Please wait a minute...
 首页  期刊介绍 期刊订阅 联系我们 横山亮次奖 百年刊庆
 
最新录用  |  预出版  |  当期目录  |  过刊浏览  |  阅读排行  |  下载排行  |  引用排行  |  横山亮次奖  |  百年刊庆
清华大学学报(自然科学版)  2018, Vol. 58 Issue (12): 1059-1065    DOI: 10.16511/j.cnki.qhdxxb.2018.21.022
  计算机科学与技术 本期目录 | 过刊浏览 | 高级检索 |
基于动态获取高频率键的MapReduce性能优化算法
李建江1, 滑水亮2, 吴杰3, 张凯1
1. 北京科技大学 计算机科学与技术系, 北京 100083, 中国;
2. 国家电网张家口供电公司信通分公司, 张家口 075000, 中国;
3. 天普大学 计算机与信息科学系, 费城 19122, 美国
Performance optimization algorithm for MapReduce based on obtaining frequent keys
LI Jianjiang1, HUA Shuiliang2, WU Jie3, ZHANG Kai1
1. Department of Computer Science and Technology, University of Science and Technology Beijing, Beijing 100083, China;
2. State Grid Zhangjiakou Power Supply Company & Telecommunication Branch, Zhangjiakou 075000, China;
3. Department of Computer and Information Sciences, Temple University, Philadelphia 19122, USA
全文: PDF(957 KB)  
输出: BibTeX | EndNote (RIS)      
摘要 在云计算技术领域中,MapReduce能够帮助人们快速处理海量数据,因此在学术界以及工业界越来越受到重视。但是MapReduce在处理以文本为中心的应用时,中间结果中数据重复较多。针对该情况,已有的高频率缓冲(frequency buffering,FB)算法提出在环形内存缓冲之前添加哈希表,并将高频率键存储在哈希表中。该算法通过采样来实现,有额外开销并且统计出的高频率键并不一定准确。该文提出一种基于动态获取高频率键的MapReduce性能优化算法,通过在环形内存缓冲之前增加计数Bloom过滤器(counting Bloom filter,CBF)和哈希表,将高频率键动态地存储在哈希表中。该算法获得的高频率键更准确,同时大大减少了数据排序和磁盘I/O的开销。实际测试结果表明:该算法明显提高了作业的执行速度,比原始MapReduce提高17.04%,比FB算法提高9.31%。
服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
李建江
滑水亮
吴杰
张凯
关键词 MapReduce高频率键性能优化计数Bloom过滤器    
Abstract:MapReduce is getting much attention in academia and industry for use in cloud computing to quickly deal with huge amounts of data. However, when MapReduce deals with text-centric applications, the algorithm generates is large amount of duplicate data in the intermediate results that increases the run time. A frequency buffering (FB) algorithm was used to add a Hash table before the ring memory to store frequent keys in a Hash table. However, since the algorithm is implemented by sampling, the algorithm may not accurately estimate the overhead and the frequent keys. Therefore, this study added a performance optimization algorithm to MapReduce to obtain the frequent keys by adding a counting Bloom filter (CBF) and a Hash table to dynamically filter the frequent keys before storing them in the ring memory. This algorithm more accurately identifies the frequent keys and greatly reduces the data sorting overhead and the disk I/O. Tests show that this performance optimization algorithm for MapReduce for obtaining the frequent keys significantly improves the execution speed by 17.04% compared to the original MapReduce and 9.31% higher than the frequency buffering algorithm.
Key wordsMapReduce    frequent key    performance optimization    counting Bloom filter
收稿日期: 2018-06-19      出版日期: 2018-12-13
基金资助:国家重点研发计划项目(2017YFB0202104,2017YFB0202003)
通讯作者: 张凯,硕士研究生,E-mail:zhchhz@163.com     E-mail: zhchhz@163.com
引用本文:   
李建江, 滑水亮, 吴杰, 张凯. 基于动态获取高频率键的MapReduce性能优化算法[J]. 清华大学学报(自然科学版), 2018, 58(12): 1059-1065.
LI Jianjiang, HUA Shuiliang, WU Jie, ZHANG Kai. Performance optimization algorithm for MapReduce based on obtaining frequent keys. Journal of Tsinghua University(Science and Technology), 2018, 58(12): 1059-1065.
链接本文:  
http://jst.tsinghuajournals.com/CN/10.16511/j.cnki.qhdxxb.2018.21.022  或          http://jst.tsinghuajournals.com/CN/Y2018/V58/I12/1059
  图1 本文提出的优化算法
  图2 CBF中元素的增加
  图3 TCBP 算法伪代码
  图4 DCBH 算法伪代码
  表1 实验环境
  图5 流经 CBF和哈希表的数据占比对 本文优化程序运行时间的影响
  图6 3种 MapReduce程序的运行时间对比
[1] HILLS T T, NOGUCHI T, GIBBERT M. Information overload or search-amplified risk set size and order effects on decisions from experience[J]. Psychonomic Bulletin & Review, 2013, 20(5):1023-1031.
[2] ZIKOPOULOS P, EATON C. Understanding big data:Analytics for enterprise class Hadoop and streaming data[M]. New York:McGraw-Hill Osborne Media, 2011.
[3] HERODOTOU H, BABU S. Profiling, what-if analysis, and cost-based optimization of mapreduce programs[J]. Proceedings of the VLDB Endowment, 2011, 4(11):1111-1122.
[4] FLORATOU A, PATEL J M, SHEKITA E J, et al. Column-oriented storage techniques for MapReduce[J]. Proceedings of the VLDB Endowment, 2011, 4(7):419-429.
[5] HE Y, LEE R, HUAI Y, et al. RCFile:A fast and space-efficient data placement structure in MapReduce-based warehouse systems[J]. 2011, 83(1):1199-1208.
[6] MAX, FAN X, LIU J, et al. vLocality:Revisiting data locality for MapReduce in virtualized clouds[J]. IEEE Network, 2017, 31(1):28-35.
[7] POLO J, CARRERA D, BECERRA Y, et al. Performance-driven task co-scheduling for mapreduce environments[C]//2010 IEEE Network Operations and Management Symposium-NOMS 2010. New York:IEEE, 2010:373-380.
[8] CHENG D, RAO J, GUO Y, et al. Improving performance of heterogeneous MapReduce clusters with adaptive task tuning[J]. IEEE Transactions on Parallel & Distributed Systems, 2017, 28(3):774-786.
[9] ZIPF G K. Selected studies of the principle of relative frequency in language[J]. Language, 1933, 9(1):89-92.
[10] RATTANAOPAS K, KAEWKEEREE S. Improving Hadoop MapReduce performance with data compression:A study using wordcount job[C]//International Conference on Electrical Engineering/electronics, Computer, Telecommunications and Information Technology. New York:IEEE, 2017:564-567.
[11] ISSA J. Performance evaluation and estimation model using regression method for Hadoop wordcount[J]. IEEE Access, 2015, 3:2784-2793.
[12] HSIAO C H, CAFARELLA M, NARAYANASAMY S. Reducing MapReduce abstraction costs for text-centric applications[C]//201443rd International Conference on Parallel Processing. New York:IEEE, 2014:40-49.
[13] NYANG D H. Counting bloom filter:U.S. Patent Application 15/021, 133[P]. 2013-10-14.
[14] ROTTENSTREICH O, KANIZO Y, KESLASSY I. The variable increment counting Bloom filter[J]. IEEE/ACM Transactions on Networking, 2014, 22(4):1092-1105.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 《清华大学学报(自然科学版)》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn