Please wait a minute...
 首页  期刊介绍 期刊订阅 联系我们 横山亮次奖 百年刊庆
 
最新录用  |  预出版  |  当期目录  |  过刊浏览  |  阅读排行  |  下载排行  |  引用排行  |  横山亮次奖  |  百年刊庆
清华大学学报(自然科学版)  2017, Vol. 57 Issue (6): 586-590    DOI: 10.16511/j.cnki.qhdxxb.2017.26.023
  计算机科学与技术 本期目录 | 过刊浏览 | 高级检索 |
适用于海量数据应用的多维Hash表结构
吴泉源, 彭灿, 郑毅, 卜俊丽
国防科技大学 计算机学院, 长沙 410073
Multi-dimensional Hash table structure for massive data applications
WU Quanyuan, PENG Can, ZHENG Yi, BU Junli
School of Computer, National University of Defense Technology, Changsha 410073, China
全文: PDF(1155 KB)  
输出: BibTeX | EndNote (RIS)      
摘要 传统的Hash表通过对目标数据进行Hash计算,可以实现数据的快速存取与检索。为了保持较好的存储性能,需要使整个Hash表保持疏松的状态,从而牺牲掉10%~25%的空间。这对于海量数据存储而言,是一种巨大的空间浪费。该文提出一种多维Hash表结构,通过增加Hash表在逻辑上的维度,大大降低了Hash表的冲突率,实现了在较高的填充率下获得较满意的性能。实验结果表明:在千万的数据量级上,二维Hash表的冲突率比传统Hash表的减小2~4个数量级,总体性能则提升了1个数量级。该文还在原有填充率的基础上,提出失效率的概念,进一步完善和统一了Hash表性能评价指标。
服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
吴泉源
彭灿
郑毅
卜俊丽
关键词 多维Hash表海量数据存储失效率    
Abstract:Traditional Hash table can quickly locate the target by calculating the Hash value of the target data to enable fast data access and retrieval. Good storage performance requires that the Hash table maintains a loose state by sacrificing 10%-25% of the space. This is a tremendous waste of space in massive data storage systems. This paper presents a multi-dimensional Hash table structure that by increasing the logical dimension of the Hash table to significantly reduce the collision rate in the Hash table for satisfactory performance with a high filling rate. Tests show that with ten million entries, the collision rate of a two-dimensional Hash table is 2-4 orders of magnitude lower than a traditional Hash table and the overall performance is improved by 1 order of magnitude. In addition, a failure rate concept is proposed to improve Hash table performance evaluations.
Key wordsmulti-dimension    Hash table    massive data storage    failure rate
收稿日期: 2016-06-28      出版日期: 2017-06-15
ZTFLH:  TP311.12  
引用本文:   
吴泉源, 彭灿, 郑毅, 卜俊丽. 适用于海量数据应用的多维Hash表结构[J]. 清华大学学报(自然科学版), 2017, 57(6): 586-590.
WU Quanyuan, PENG Can, ZHENG Yi, BU Junli. Multi-dimensional Hash table structure for massive data applications. Journal of Tsinghua University(Science and Technology), 2017, 57(6): 586-590.
链接本文:  
http://jst.tsinghuajournals.com/CN/10.16511/j.cnki.qhdxxb.2017.26.023  或          http://jst.tsinghuajournals.com/CN/Y2017/V57/I6/586
  图1 一维Hash表与二维Hash表示意图
  表1 Hash表的平均查找长度
  图2 JavaHash表与二维Hash表Hash 函数冲突率对比
  表2 JavaHash表平均存入时间
  表3 二维Hash表平均存入时间
  图3 扩充率为0.75时JavaHash表与二维Hash表的平均存入时间对比
  图4 扩充率为1.0时JavaHash表与二维Hash表的平均存入时间对比
[1] Zwolenski M, Weatherill L. The digital universe rich data and the increasing value of the Internet of things[J]. Australian Journal of Telecommunications & the Digital Economy, 2014, 2(3):471-479.
[2] DeCandia G, Hastorun D, Jampani M, et al. Dynamo:Amazon's highly available key-value store[J]. ACM SIGOPS Operating Systems Review, 2007, 41(6):205-220.
[3] Chang F, Dean J, Ghemawat S, et al. Bigtable:A distributed storage system for structured data[J]. ACM Transactions on Computer Systems (TOCS), 2008, 26(2):4-4.
[4] Lakshman A, Malik P. Cassandra:A decentralized structured storage system[J]. ACM SIGOPS Operating Systems Review, 2010, 44(2):35-40.
[5] Henke C, Schmoll C, Zseby T. Empirical evaluation of Hash functions for multipoint measurements[J]. ACM SIGCOMM Computer Communication Review, 2008, 38(3):39-50.
[6] Christian H, Carsten S, Tanja Z. Empirical evaluation of Hash functions for packetID generation in sampled multipoint measurements[C]//Passive and Active Network Measurement, International Conference 2009. Seoul, Korea:Proceedings, 2009:197-206.
[7] Rivest R L. The MD5 message-digest algorithm[J]. Network Working Group IETF, 1992, 473(10):303-311.
[8] Eastlake Rd D, Jones P. US Secure Hash Algorithm 1(SHA1)[M]. Los Argeles County:RFC Editor, 2001.
[9] Peterson W W, Brown D T. Cyclic codes for error detection[J]. Proceedings of the IRE, 1961, 49(1):228-235.
[10] Enbody R J, Du H C. Dynamic hashing schemes[J]. ACM Computing Surveys (CSUR), 1988, 20(2):850-113.
[11] Larson P A. Dynamic hashing[J]. BIT Numerical Mathematics, 1978, 18(2):184-201.
[12] Singh B, Yadav I, Agarwal S, et al. An efficient word searching algorithm through splitting and hashing the offline text[C]//Advances in Recent Technologies in Communication and Computing. Kottayam, India:IEEE Press, 2009:387-389.
[1] 蔡鸿明, 姜祖海, 姜丽红. 分布式环境下业务模型的数据存储及访问框架[J]. 清华大学学报(自然科学版), 2017, 57(6): 569-574.
[2] 郭少茹, 张虎, 钱揖丽, 李茹, 杨陟卓, 顾兆军, 马淑晖. 面向高考阅读理解的句子语义相关度[J]. 清华大学学报(自然科学版), 2017, 57(6): 575-579,585.
[3] 周树桥, 李铎. 双冗余控制器的失效状态分析及面向高可靠度的设计[J]. 清华大学学报(自然科学版), 2017, 57(4): 399-404.
[4] 林正航, 强茂山, 袁尚南. 大型国际工程承包商核心能力模型评价[J]. 清华大学学报(自然科学版), 2015, 55(12): 1309-1314,1323.
[5] 韩心慧, 肖祥全, 张建宇, 刘丙双, 张缘. 基于社交关系的DHT网络Sybil攻击防御[J]. 清华大学学报(自然科学版), 2014, 54(1): 1-7.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 《清华大学学报(自然科学版)》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn