论文标题
lotaru:在本地估算异质群中科学工作流程任务的运行时间
Lotaru: Locally Estimating Runtimes of Scientific Workflow Tasks in Heterogeneous Clusters
论文作者
论文摘要
需要了解许多科学工作流程安排算法,以了解任务运行时间A-Priori,以进行有效的调度。在异质集群基础架构中,此问题变得严重,因为每个任务节点对都需要这些运行时间。使用历史数据通常是不可行的,因为日志通常不会无限期地保留,以及基础架构的变化。相比之下,在工作流程运行时预测特定节点的任务运行时,在线方法必须应对缺乏示例运行,尤其是在启动过程中。 在本文中,我们提出了一种新颖的在线方法,用于在异质群中进行科学工作流程中的任务运行时间进行本地估算。 Lotaru首先配置了一个群集的所有节点,这些节点具有一组短路和均匀的微型分析。接下来,它运行的工作流程将在用户的本地计算机上安排,并具有大量减少的数据,以确定重要的任务特征。基于这些测量值,Lotaru学习了一个贝叶斯线性回归模型,以预测任务的运行时,并最终根据微基准的结果调整了群集中每个任务节点对的预测运行时。由于其贝叶斯方法,Lotaru还可以计算强大的不确定性估计,并将其作为高级调度方法的输入。 我们对五个现实世界的科学工作流和不同数据集进行的评估表明,Lotaru在均质和异质簇的预测错误方面显着优于基准。
Many scientific workflow scheduling algorithms need to be informed about task runtimes a-priori to conduct efficient scheduling. In heterogeneous cluster infrastructures, this problem becomes aggravated because these runtimes are required for each task-node pair. Using historical data is often not feasible as logs are typically not retained indefinitely and workloads as well as infrastructure changes. In contrast, online methods, which predict task runtimes on specific nodes while the workflow is running, have to cope with the lack of example runs, especially during the start-up. In this paper, we present Lotaru, a novel online method for locally estimating task runtimes in scientific workflows on heterogeneous clusters. Lotaru first profiles all nodes of a cluster with a set of short-running and uniform microbenchmarks. Next, it runs the workflow to be scheduled on the user's local machine with drastically reduced data to determine important task characteristics. Based on these measurements, Lotaru learns a Bayesian linear regression model to predict a task's runtime given the input size and finally adjusts the predicted runtime specifically for each task-node pair in the cluster based on the micro-benchmark results. Due to its Bayesian approach, Lotaru can also compute robust uncertainty estimates and provides them as an input for advanced scheduling methods. Our evaluation with five real-world scientific workflows and different datasets shows that Lotaru significantly outperforms the baselines in terms of prediction errors for homogeneous and heterogeneous clusters.