Abstract:The increasing demand for real-time data stream processing is driving the development of distributed stream processing systems. The large amount of skew data streams and the heterogeneity of complex distributed systems pose challenges to the current grouping strategies of distributed stream processing systems. The existing distributed stream processing grouping strategies usually focus on balancing the number of tuples between parallel instances, while ignoring the impact of system heterogeneity on the grouping strategy. This paper presents a time-aware grouping algorithm that analyzes the network heterogeneity and the processing capability in a distributed stream processing system that considers the processing time of each downstream operator instance in the stream processing system. The algorithm also takes into account the communication time between the upstream and downstream operators with various routing strategies formulated according to the frequency of the key, so that the system achieves load balancing with little overhead. Tests on an Apache Flink distributed stream processing system show that the time-aware grouping algorithm increases the throughput by 10% while the average processing latency is reduced by 33% compared to the existing grouping algorithm.
[1] TOSHNIWAL A, TANEJA S, SHUKLA A, et al. Storm@Twitter[C]//Özsu M T. Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. New York, USA, 2014:147-156. [2] CARBONE P, KATSIFODIMOS A, EWEN S, et al. Apache FlinkTM:Stream and batch processing in a single engine[J]. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2015, 38(4):28-38. [3] BABCOCK B, BABU S, DATAR M, et al. Models and issues in data stream systems[C]//Proceedings of the 21st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. New York, USA, 2002:1-16. [4] KATSIPOULAKIS N R, LABRINIDIS A, CHRYSANTHIS P K. A holistic view of stream partitioning costs[J]. Proceedings of the VLDB Endowment, 2017, 10(11):1286-1297. [5] FANG J H, CHAO P F, ZHANG R, et al. Integrating workload balancing and fault tolerance in distributed stream processing system[J]. World Wide Web, 2019:1-26. DOI:10.1007/s11280-018-0656-0. [6] GEDIK B. Partitioning functions for stateful data parallelism in stream processing[J]. The VLDB Journal, 2014, 23(4):517-539. [7] NASIR M A U, MORALES G D F, GARCIA-SORIANO D, et al. The power of both choices:Practical load balancing for distributed stream processing engines[C]//2015 IEEE 31st International Conference on Data Engineering. Seoul, South Korea, 2015:137-148. [8] RIVETTI N, QUERZONI L, ANCEAUME E, et al. Efficient key grouping for near-optimal load balancing in stream processing systems[C]//Proceedings of the 9th ACM International Conference on Distributed Event-Based Systems. New York, USA, 2015:80-91. [9] CHEN F, WU S, JIN H. Network-aware grouping in distributed stream processing systems[C]//Proceedings of the 18th International Conference on Algorithms and Architectures for Parallel Processing. Guangzhou, China, 2018:3-18. [10] PACACI A, ÖZSU M T. Distribution-aware stream partitioning for distributed stream processing systems[C]//Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond. New York, USA, 2018:1-10. [11] NASIR M A U, HORII H, SERAFINI M, et al. Load balancing for skewed streams on heterogeneous cluster[Z]. arXiv preprint. arXiv:1705.09073, 2017. [12] METWALLY A, AGRAWAL D, EL ABBADI A. Efficient computation of frequent and top-k elements in data streams[C]//Proceedings of the 10th International Conference on Database Theory. Edinburgh, UK, 2005:398-412. [13] URDANETA G, PIERRE G, VAN STEEN M. Wikipedia workload analysis for decentralized hosting[J]. Computer Networks, 2009, 53(11):1830-1845.