论文标题
NVIDIA GPU中的单个和多设备同步方法的研究
A Study of Single and Multi-device Synchronization Methods in Nvidia GPUs
论文作者
论文摘要
GPU在通用计算中起着越来越重要的作用。许多算法都需要在单个GPU中在不同水平的粒度上进行同步。此外,密集的GPU节点的出现还需要多GPU同步。 NVIDIA的最新CUDA提供了多种同步方法。到目前为止,还没有完全了解这些同步方法的特征。这项工作探讨了重要的无证件特征,并对NVIDIA GPU的最新同步方法的性能注意事项和陷阱提供了深入的分析。当在单个和/或多GPU环境上运行的应用程序,库和框架做出设计选择时,提供的分析将很有用。我们为常用的还原操作员提供了一个案例研究,以说明分析中获得的知识如何有用。我们还描述了我们的微基准和测量方法。
GPUs are playing an increasingly important role in general-purpose computing. Many algorithms require synchronizations at different levels of granularity in a single GPU. Additionally, the emergence of dense GPU nodes also calls for multi-GPU synchronization. Nvidia's latest CUDA provides a variety of synchronization methods. Until now, there is no full understanding of the characteristics of those synchronization methods. This work explores important undocumented features and provides an in-depth analysis of the performance considerations and pitfalls of the state-of-art synchronization methods for Nvidia GPUs. The provided analysis would be useful when making design choices for applications, libraries, and frameworks running on single and/or multi-GPU environments. We provide a case study of the commonly used reduction operator to illustrate how the knowledge gained in our analysis can be useful. We also describe our micro-benchmarks and measurement methods.