管道：大规模AI操作平台的痕量驱动模拟

论文标题

管道：大规模AI操作平台的痕量驱动模拟

PipeSim: Trace-driven Simulation of Large-Scale AI Operations Platforms

论文作者

Rausch, Thomas, Hummer, Waldemar, Muthusamy, Vinod

论文摘要

在研究和行业中，运营AI已成为主要努力。管理AI应用程序生命周期的自动化，操作的管道将构成明天基础设施工作负载的重要组成部分。为了优化生产级AI工作流平台的运营，我们可以利用现有的调度方法，但是对于实现特定于应用程序的成本效果的微调策略，在迎合机器学习（ML）模型的特定领域特征（例如准确性，鲁棒性或公平性）的同时，这是一项挑战。我们提出了一个基于微量驱动的基于仿真的实验和分析环境，该环境使研究人员和工程师可以设计和评估大规模AI工作流程系统的此类操作策略。来自IBM开发的生产级AI平台的分析数据用于构建全面的仿真模型。我们的仿真模型描述了管道和系统基础架构之间的相互作用，以及管道任务如何影响不同的ML模型指标。我们在独立的，随机的，离散的事件模拟器中实现该模型，并为运行实验提供了工具包。合成轨迹可用于临时探索以及对实验的统计分析，以测试和检查管道调度，集群资源分配和类似的操作机制。

Operationalizing AI has become a major endeavor in both research and industry. Automated, operationalized pipelines that manage the AI application lifecycle will form a significant part of tomorrow's infrastructure workloads. To optimize operations of production-grade AI workflow platforms we can leverage existing scheduling approaches, yet it is challenging to fine-tune operational strategies that achieve application-specific cost-benefit tradeoffs while catering to the specific domain characteristics of machine learning (ML) models, such as accuracy, robustness, or fairness. We present a trace-driven simulation-based experimentation and analytics environment that allows researchers and engineers to devise and evaluate such operational strategies for large-scale AI workflow systems. Analytics data from a production-grade AI platform developed at IBM are used to build a comprehensive simulation model. Our simulation model describes the interaction between pipelines and system infrastructure, and how pipeline tasks affect different ML model metrics. We implement the model in a standalone, stochastic, discrete event simulator, and provide a toolkit for running experiments. Synthetic traces are made available for ad-hoc exploration as well as statistical analysis of experiments to test and examine pipeline scheduling, cluster resource allocation, and similar operational mechanisms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题