As-Stream: An intelligent operator parallelization strategy for fluctuating data streams
LI Wei1,2, LI Chenglong3, YANG Jiahai3
1. Information Technology Center, Tsinghua University, Beijing 100084, China; 2. China University of Geosciences, Beijing 100083, China; 3. Institute for Network Science and Cyberspace, Tsinghua University, Beijing 100084, China
Abstract:A large number of studies have presented methods using online resource management to optimize stream computing for fluctuating data streams, but have not optimized the parallel operator operations at the streaming application level. For example, in Apache Storm, the operator parallelism cannot be dynamically adjusted once it is set. This paper presents an intelligent parallelization strategy for operators with fluctuating data streams, As-Stream, which significantly improves the streaming computing platform performance. This method uses real-time tuning of parameters based on unsupervised learning and self-adaptive analyses in an elastic intelligent monitoring module. As-Stream includes parallel bottleneck identification, parameter plan generation, parameter migration conversion and parameter migration scheduling algorithms. The system was implemented on an Apache Storm platform with a large number of tests in a real distributed stream computing environment. The results show that this system significantly improves the performance compared with existing default scheduling strategies. With sufficient resources, the average throughput is increased 2.4 fold while with limited resources, the average latency is reduced by 44%.
李维, 李城龙, 杨家海. As-Stream:一种针对波动数据流的算子智能并行化策略[J]. 清华大学学报(自然科学版), 2022, 62(12): 1851-1863.
LI Wei, LI Chenglong, YANG Jiahai. As-Stream: An intelligent operator parallelization strategy for fluctuating data streams. Journal of Tsinghua University(Science and Technology), 2022, 62(12): 1851-1863.
[1] Apache Storm[EB/OL]. (2013-11-05)[2021-12-30]. https://storm.apache.org. [2] Apache Heron[EB/OL]. (2015-09-25)[2021-12-30]. https://heron.apache.org. [3] Apache Flink®[EB/OL]. (2014-06-07)[2021-12-30]. https://flink.apache.org. [4] Apache. Spark Streaming[EB/OL]. (2014-02-25)[2021-12-30]. https://spark.apache.org/streaming. [5] Apache Samza[EB/OL]. (2013-11-05)[2021-12-30]. https://samza.apache.org. [6] Apache ApexTM[EB/OL]. (2015-03-14)[2021-12-30]. https://apex.apache.org. [7] Google. Cloud Dataflow[EB/OL]. (2014-6-05)[2021-12-30]. https://cloud.google.com/dataflow. [8] Apache. Timely Dataflow[EB/OL]. (2014-12-07)[2021-12-30]. https://github.com/timelydataflow. [9] ZHAO Y J, LIU Z, WU Y D, et al. Timestamped state sharing for stream analytics[J]. IEEE Transactions on Parallel and Distributed Systems, 2021, 32(11): 2691-2704. [10] DENG S Z, WANG B T, HUANG S, et al. Self-adaptive framework for efficient stream data classification on storm[J]. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2020, 50(1): 123-136. [11] MUHAMMAD A, ALEEM M, ISLAM M A. TOP-Storm: A topology-based resource-aware scheduler for stream processing engine[J]. Cluster Computing, 2021, 24(1): 417-431. [12] AL-SINAYYID A, ZHU M. Job scheduler for streaming applications in heterogeneous distributed processing systems[J]. The Journal of Supercomputing, 2020, 76(12): 9609-9628. [13] ZHENG T Y, CHEN G, WANG X Y, et al. Real-time intelligent big data processing: Technology, platform, and applications[J]. Science China Information Sciences, 2019, 62(8): 82101. [14] LI W, SUN D W, GAO S, et al. A machine learning-based elastic strategy for operator parallelism in a big data stream computing system[C]//Proceedings of the 12th EAI International Conference on Broadband Communications, Networks, and Systems. Melbourne, Australia: Springer, 2021: 3-19. [15] ZHOU Q H, GUO S, LU H D, et al. Falcon: Addressing stragglers in heterogeneous parameter server via multiple parallelism[J]. IEEE Transactions on Computers, 2021, 70(1): 139-155. [16] HERODOTOU H, CHEN Y X, LU J H. A survey on automatic parameter tuning for big data processing systems[J]. ACM Computing Surveys, 2021, 53(2): 43. [17] MUHAMMAD A, ALEEM M. A3Storm: Topology, traffic, and resourceaware storm scheduler for heterogeneous clusters[J]. The Journal of Supercomputing, 2021, 77(2): 1059-1093. [18] CHENG D Z, ZHOU X B, WANG Y, et al. Adaptive scheduling parallel jobs with dynamic batching in spark streaming[J]. IEEE Transactions on Parallel and Distributed Systems, 2018, 29(12): 2672-2685. [19] FANG J H, ZHANG R, FU T Z J, et al. Distributed stream rebalance for stateful operator under workload variance[J]. IEEE Transactions on Parallel and Distributed Systems, 2018, 29(10): 2223-2240. [20] ISLAM M T, KARUNASEKERA S, BUYYA R. dSpark: Deadline-based resource allocation for big data applications in Apache Spark[C]//Proceedings of the 2017 IEEE 13th International Conference on e-Science (e-Science). Auckland, New Zealand: IEEE, 2017: 89-98. [21] WANG W A, ZHANG C, CHEN X J, et al. An on-the-fly scheduling strategy for distributed stream processing platform[C]//Proceedings of the 2018 IEEE International Conference on Parallel and Distributed. Melbourne, VIC, Australia: IEEE, 2018: 773-780. [22] LI W X, LIU D W, CHEN K, et al. Hone: Mitigating stragglers in distributed stream processing with tuple scheduling[J]. IEEE Transactions on Parallel and Distributed Systems, 2021, 32(8): 2021-2034. [23] WEI X H, LI L, LI X, et al. Pec: Proactive elastic collaborative resource scheduling in data stream processing[J]. IEEE Transactions on Parallel and Distributed Systems, 2019, 30(7): 1628-1642. [24] ARMBRUST M, DAS T, TORRES J, et al. Structured streaming: A declarative API for real-time applications in Apache Spark[C]//Proceedings of the 2018 International Conference on Management of Data. Houston, TX, USA: ACM, 2018: 601-613. [25] SUN D W, HE H Y, YAN H B, et al. Lr-Stream: Using latency and resource aware scheduling to improve latency and throughput for streaming applications[J]. Future Generation Computer Systems, 2021, 114: 243-258. [26] AO W C, PSOUNIS K. Resource-constrained replication strategies for hierarchical and heterogeneous tasks[J]. IEEE Transactions on Parallel and Distributed Systems, 2020, 31(4): 793-804. [27] CAO H Y, WU C Q, BAO L, et al. Throughput optimization for storm-based processing of stream data on clouds[J]. Future Generation Computer Systems, 2020, 112: 567-579. [28] LIU S C, WENG J P, WANG J H, et al. An adaptive online scheme for scheduling and resource enforcement in storm[J]. IEEE/ACM Transactions on Networking, 2019, 27(4): 1373-1386. [29] ESKANDARI L, MAIR J, HUANG Z Y, et al. T3-Scheduler: A topology and traffic aware two-level scheduler for stream processing systems in a heterogeneous cluster[J]. Future Generation Computer Systems, 2018, 89: 617-632. [30] CHEN H H, ZHANG F, JIN H. PStream: A popularity-aware differentiated distributed stream processing system[J]. IEEE Transactions on Computers, 2021, 70(10): 1582-1597. [31] KHATIBI E, MIRTAHERI S L. A dynamic data dissemination mechanism for Cassandra NoSQL data store[J]. The Journal of Supercomputing, 2019, 75(11): 7479-7496. [32] LIU P C, XU H L, DA SILVA D, et al. FP4S: Fragment-based parallel state recovery for stateful stream applications[C]//Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS). New Orleans, LA, USA: IEEE, 2020: 1102-1111.