Please wait a minute...
 首页  期刊介绍 期刊订阅 联系我们 横山亮次奖 百年刊庆
最新录用  |  预出版  |  当期目录  |  过刊浏览  |  阅读排行  |  下载排行  |  引用排行  |  横山亮次奖  |  百年刊庆
清华大学学报(自然科学版)  2020, Vol. 60 Issue (10): 822-828    DOI: 10.16511/j.cnki.qhdxxb.2020.22.004
  专题:容错计算 本期目录 | 过刊浏览 | 高级检索 |
杨李杨, 陈思远, 张展, 朱和一, 左德承
哈尔滨工业大学 计算机科学与技术学院, 哈尔滨 150001
Time-aware grouping algorithm for distributed stream processing systems
YANG Liyang, CHEN Siyuan, ZHANG Zhan, ZHU Heyi, ZUO Decheng
School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
全文: PDF(2424 KB)  
输出: BibTeX | EndNote (RIS)      
摘要 随着对实时数据流处理需求的增加,分布式流处理系统的发展也越来越受到关注。大量的倾斜的数据流以及复杂分布式系统的异构性对当前的分布式流处理系统的分组策略提出了挑战。目前已有的分布式流处理分组策略通常关注并行实例之间元组数量的均衡性,而忽视了系统异构性对分组策略造成的影响。该文提出了一种时间感知分组算法,通过对分布式流处理系统存在的网络异构性和处理能力异构性的分析,综合考虑流处理系统中各下游算子实例的处理时间以及上游算子与下游算子之间的通信时间,并根据键值的频率不同制定不同的路由策略,在较小的开销下使系统达到负载均衡。在Apache Flink分布式流处理系统上进行的实验结果表明:时间感知分组算法比已有的分组算法在系统吞吐量上提高了10%,在平均处理延迟上降低了33%。
E-mail Alert
关键词 流处理分组算法时间感知    
Abstract:The increasing demand for real-time data stream processing is driving the development of distributed stream processing systems. The large amount of skew data streams and the heterogeneity of complex distributed systems pose challenges to the current grouping strategies of distributed stream processing systems. The existing distributed stream processing grouping strategies usually focus on balancing the number of tuples between parallel instances, while ignoring the impact of system heterogeneity on the grouping strategy. This paper presents a time-aware grouping algorithm that analyzes the network heterogeneity and the processing capability in a distributed stream processing system that considers the processing time of each downstream operator instance in the stream processing system. The algorithm also takes into account the communication time between the upstream and downstream operators with various routing strategies formulated according to the frequency of the key, so that the system achieves load balancing with little overhead. Tests on an Apache Flink distributed stream processing system show that the time-aware grouping algorithm increases the throughput by 10% while the average processing latency is reduced by 33% compared to the existing grouping algorithm.
Key wordsstream processing    grouping algorithm    time-aware
收稿日期: 2019-09-15      出版日期: 2020-07-09
杨李杨, 陈思远, 张展, 朱和一, 左德承. 分布式流处理系统的时间感知分组算法[J]. 清华大学学报(自然科学版), 2020, 60(10): 822-828.
YANG Liyang, CHEN Siyuan, ZHANG Zhan, ZHU Heyi, ZUO Decheng. Time-aware grouping algorithm for distributed stream processing systems. Journal of Tsinghua University(Science and Technology), 2020, 60(10): 822-828.
链接本文:  或
[1] TOSHNIWAL A, TANEJA S, SHUKLA A, et al. Storm@Twitter[C]//Özsu M T. Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. New York, USA, 2014:147-156.
[2] CARBONE P, KATSIFODIMOS A, EWEN S, et al. Apache FlinkTM:Stream and batch processing in a single engine[J]. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2015, 38(4):28-38.
[3] BABCOCK B, BABU S, DATAR M, et al. Models and issues in data stream systems[C]//Proceedings of the 21st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. New York, USA, 2002:1-16.
[4] KATSIPOULAKIS N R, LABRINIDIS A, CHRYSANTHIS P K. A holistic view of stream partitioning costs[J]. Proceedings of the VLDB Endowment, 2017, 10(11):1286-1297.
[5] FANG J H, CHAO P F, ZHANG R, et al. Integrating workload balancing and fault tolerance in distributed stream processing system[J]. World Wide Web, 2019:1-26. DOI:10.1007/s11280-018-0656-0.
[6] GEDIK B. Partitioning functions for stateful data parallelism in stream processing[J]. The VLDB Journal, 2014, 23(4):517-539.
[7] NASIR M A U, MORALES G D F, GARCIA-SORIANO D, et al. The power of both choices:Practical load balancing for distributed stream processing engines[C]//2015 IEEE 31st International Conference on Data Engineering. Seoul, South Korea, 2015:137-148.
[8] RIVETTI N, QUERZONI L, ANCEAUME E, et al. Efficient key grouping for near-optimal load balancing in stream processing systems[C]//Proceedings of the 9th ACM International Conference on Distributed Event-Based Systems. New York, USA, 2015:80-91.
[9] CHEN F, WU S, JIN H. Network-aware grouping in distributed stream processing systems[C]//Proceedings of the 18th International Conference on Algorithms and Architectures for Parallel Processing. Guangzhou, China, 2018:3-18.
[10] PACACI A, ÖZSU M T. Distribution-aware stream partitioning for distributed stream processing systems[C]//Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond. New York, USA, 2018:1-10.
[11] NASIR M A U, HORII H, SERAFINI M, et al. Load balancing for skewed streams on heterogeneous cluster[Z]. arXiv preprint. arXiv:1705.09073, 2017.
[12] METWALLY A, AGRAWAL D, EL ABBADI A. Efficient computation of frequent and top-k elements in data streams[C]//Proceedings of the 10th International Conference on Database Theory. Edinburgh, UK, 2005:398-412.
[13] URDANETA G, PIERRE G, VAN STEEN M. Wikipedia workload analysis for decentralized hosting[J]. Computer Networks, 2009, 53(11):1830-1845.
No related articles found!
Full text



版权所有 © 《清华大学学报(自然科学版)》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持