论文标题

评估混合云中的分布式数据处理框架

Evaluation of Distributed Data Processing Frameworks in Hybrid Clouds

论文作者

Ullah, Faheem, Dhingra, Shagun, Xia, Xiaoyu, Babar, M. Ali

论文摘要

分布式数据处理框架(例如Hadoop,Spark和Flink)广泛用于在云的计算节点之间分布数据。最近,旨在评估私有云和公共云中托管的分布式数据处理框架的性能的越来越多的努力。但是,关于评估混合云中这些框架的性能的研究很少,这是一个新兴的云模型,它集成了私有和公共云以利用两全其美的世界。因此,在本文中,我们在执行时间,资源利用率,水平可扩展性,垂直可伸缩性和成本方面评估了混合云中Hadoop,Spark和Flink的性能。在这项研究中,我们的混合云由OpenStack(Private Cloud)和MS Azure(公共云)组成。我们同时使用批次和迭代工作负载进行评估。 Our results show that in a hybrid cloud (i) the execution time increases as more nodes are borrowed by the private cloud from the public cloud, (ii) Flink outperforms Spark, which in turn outperforms Hadoop in terms of execution time, (iii) Hadoop transfers the largest amount of data among the nodes during the workload execution while Spark transfers the least amount of data, (iv) all three frameworks horizo​​ntally scale better as compared to垂直缩放缩放,(v)火花在数据处理的$成本方面发现最不昂贵,而Hadoop最昂贵。

Distributed data processing frameworks (e.g., Hadoop, Spark, and Flink) are widely used to distribute data among computing nodes of a cloud. Recently, there have been increasing efforts aimed at evaluating the performance of distributed data processing frameworks hosted in private and public clouds. However, there is a paucity of research on evaluating the performance of these frameworks hosted in a hybrid cloud, which is an emerging cloud model that integrates private and public clouds to use the best of both worlds. Therefore, in this paper, we evaluate the performance of Hadoop, Spark, and Flink in a hybrid cloud in terms of execution time, resource utilization, horizontal scalability, vertical scalability, and cost. For this study, our hybrid cloud consists of OpenStack (private cloud) and MS Azure (public cloud). We use both batch and iterative workloads for the evaluation. Our results show that in a hybrid cloud (i) the execution time increases as more nodes are borrowed by the private cloud from the public cloud, (ii) Flink outperforms Spark, which in turn outperforms Hadoop in terms of execution time, (iii) Hadoop transfers the largest amount of data among the nodes during the workload execution while Spark transfers the least amount of data, (iv) all three frameworks horizontally scale better as compared to vertical scaling, and (v) Spark is found to be least expensive in terms of $ cost for data processing while Hadoop is found the most expensive.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源