论文标题
分析和改进Pytorch数据加载器以进行高延迟存储:技术报告
Profiling and Improving the PyTorch Dataloader for high-latency Storage: A Technical Report
论文作者
论文摘要
越来越多的机器学习框架最近通过直接使用复杂的神经网络架构和算法来使更多的工程师,科学家和从业人员可以访问更多的深度学习。但是,由于深度学习正在迅速发展,不仅通过理论进步,而且还通过硬件和软件工程技术,ML框架通常会失去向后兼容性,并引入技术债务,这可能导致瓶颈和次优的资源利用。此外,在大多数情况下,重点不是深度学习工程,而是在新的模型和理论进步上。但是,在这项工作中,我们更专门针对工程,专门针对Pytorch框架中的数据加载管道。我们设计了一系列基准测试,这些基准概述了数据加载过程中某些步骤的性能问题。我们的发现表明,对于涉及加载许多文件(例如图像)的分类任务,训练墙时间可以大大改善。借助我们的新修改的ContrentDataloader,我们可以在GPU利用率方面进行改进,并显着减少批处理加载时间,最高为12倍。这允许将基于云的类似S3的对象存储用于数据集,并具有可比的训练时间,就好像数据集存储在本地驱动器上一样。
A growing number of Machine Learning Frameworks recently made Deep Learning accessible to a wider audience of engineers, scientists, and practitioners, by allowing straightforward use of complex neural network architectures and algorithms. However, since deep learning is rapidly evolving, not only through theoretical advancements but also with respect to hardware and software engineering, ML frameworks often lose backward compatibility and introduce technical debt that can lead to bottlenecks and sub-optimal resource utilization. Moreover, the focus is in most cases not on deep learning engineering, but rather on new models and theoretical advancements. In this work, however, we focus on engineering, more specifically on the data loading pipeline in the PyTorch Framework. We designed a series of benchmarks that outline performance issues of certain steps in the data loading process. Our findings show that for classification tasks that involve loading many files, like images, the training wall-time can be significantly improved. With our new, modified ConcurrentDataloader we can reach improvements in GPU utilization and significantly reduce batch loading time, up to 12X. This allows for the use of the cloud-based, S3-like object storage for datasets, and have comparable training time as if datasets are stored on local drives.