麻省理工学院超级云工作负载分类挑战

论文标题

麻省理工学院超级云工作负载分类挑战

The MIT Supercloud Workload Classification Challenge

论文作者

Tang, Benny J., Chen, Qiqi, Weiss, Matthew L., Frey, Nathan, McDonald, Joseph, Bestor, David, Yee, Charles, Arcand, William, Byun, Chansup, Edelman, Daniel, Hubbell, Matthew, Jones, Michael, Kepner, Jeremy, Klein, Anna, Michaleas, Adam, Michaleas, Peter, Milechin, Lauren, Mullen, Julia, Prout, Andrew, Reuther, Albert, Rosa, Antonio, Bowne, Andrew, McEvoy, Lindsey, Li, Baolin, Tiwari, Devesh, Gadepally, Vijay, Samsi, Siddharth

论文摘要

高性能计算（HPC）中心和云提供商支持越来越多样化的异质硬件应用程序。随着人工智能（AI）和机器学习（ML）工作负载已成为计算工作负载中越来越大的份额，需要新的资源使用，分配和部署新的AI框架的新方法。通过识别计算工作负载及其利用特性，HPC系统可能能够更好地匹配可用资源与应用程序需求。通过利用数据中心仪器，可以开发基于AI的方法来识别工作负载并向研究人员和数据中心运营商提供反馈，以提高运营效率。为了启用这项研究，我们发布了MIT SuperCloud数据集，该数据集提供了MIT SuperCloud群集的详细监视日志。该数据集包括作业，内存使用情况和文件系统日志的CPU和GPU使用情况。在本文中，我们提出了基于此数据集的工作负载分类挑战。我们介绍了一个标记的数据集，该数据集可用于开发新方法来进行工作负载分类，并根据现有方法提出初始结果。这一挑战的目的是在分析计算工作负载时促进算法创新，这些创新可以达到比现有方法更高的精度。数据和代码将通过数据中心挑战网站公开提供：https：//dcc.mit.edu。

High-Performance Computing (HPC) centers and cloud providers support an increasingly diverse set of applications on heterogenous hardware. As Artificial Intelligence (AI) and Machine Learning (ML) workloads have become an increasingly larger share of the compute workloads, new approaches to optimized resource usage, allocation, and deployment of new AI frameworks are needed. By identifying compute workloads and their utilization characteristics, HPC systems may be able to better match available resources with the application demand. By leveraging datacenter instrumentation, it may be possible to develop AI-based approaches that can identify workloads and provide feedback to researchers and datacenter operators for improving operational efficiency. To enable this research, we released the MIT Supercloud Dataset, which provides detailed monitoring logs from the MIT Supercloud cluster. This dataset includes CPU and GPU usage by jobs, memory usage, and file system logs. In this paper, we present a workload classification challenge based on this dataset. We introduce a labelled dataset that can be used to develop new approaches to workload classification and present initial results based on existing approaches. The goal of this challenge is to foster algorithmic innovations in the analysis of compute workloads that can achieve higher accuracy than existing methods. Data and code will be made publicly available via the Datacenter Challenge website : https://dcc.mit.edu.

下载PDF全文

下载文献需遵守相关版权规定

论文标题