论文标题

管道:大规模AI操作平台的痕量驱动模拟

PipeSim: Trace-driven Simulation of Large-Scale AI Operations Platforms

论文作者

Rausch, Thomas, Hummer, Waldemar, Muthusamy, Vinod

论文摘要

在研究和行业中,运营AI已成为主要努力。管理AI应用程序生命周期的自动化,操作的管道将构成明天基础设施工作负载的重要组成部分。为了优化生产级AI工作流平台的运营,我们可以利用现有的调度方法,但是对于实现特定于应用程序的成本效果的微调策略,在迎合机器学习(ML)模型的特定领域特征(例如准确性,鲁棒性或公平性)的同时,这是一项挑战。我们提出了一个基于微量驱动的基于仿真的实验和分析环境,该环境使研究人员和工程师可以设计和评估大规模AI工作流程系统的此类操作策略。来自IBM开发的生产级AI平台的分析数据用于构建全面的仿真模型。我们的仿真模型描述了管道和系统基础架构之间的相互作用,以及管道任务如何影响不同的ML模型指标。我们在独立的,随机的,离散的事件模拟器中实现该模型,并为运行实验提供了工具包。合成轨迹可用于临时探索以及对实验的统计分析,以测试和检查管道调度,集群资源分配和类似的操作机制。

Operationalizing AI has become a major endeavor in both research and industry. Automated, operationalized pipelines that manage the AI application lifecycle will form a significant part of tomorrow's infrastructure workloads. To optimize operations of production-grade AI workflow platforms we can leverage existing scheduling approaches, yet it is challenging to fine-tune operational strategies that achieve application-specific cost-benefit tradeoffs while catering to the specific domain characteristics of machine learning (ML) models, such as accuracy, robustness, or fairness. We present a trace-driven simulation-based experimentation and analytics environment that allows researchers and engineers to devise and evaluate such operational strategies for large-scale AI workflow systems. Analytics data from a production-grade AI platform developed at IBM are used to build a comprehensive simulation model. Our simulation model describes the interaction between pipelines and system infrastructure, and how pipeline tasks affect different ML model metrics. We implement the model in a standalone, stochastic, discrete event simulator, and provide a toolkit for running experiments. Synthetic traces are made available for ad-hoc exploration as well as statistical analysis of experiments to test and examine pipeline scheduling, cluster resource allocation, and similar operational mechanisms.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源