论文标题
通过管道提高OpenCL内核的效率
Improving the Efficiency of OpenCL Kernels through Pipes
论文作者
论文摘要
为了降低更广泛的社区采用FPGA的障碍,如今,主要的FPGA供应商为OpenCL代码提供了编译器工具链。在使用这些工具链允许将现有代码移植到FPGA的同时,确保跨设备(即CPU,GPU和FPGA)的性能便携性并不是一项琐碎的任务。这部分是由于这些设备的不同硬件特性,包括硬件并行性的性质和它们提供的内存带宽。特别是,已知全局内存访问是部署在FPGA上的OpenCL内核的主要性能瓶颈之一。在本文中,我们研究了管道的使用来改善在FPGA上运行的OpenCL内核的内存带宽利用率和性能。这是通过将全局内存访问与计算分开的方法来完成的,从而可以更好地利用访问全局内存所需的负载单元。我们对具有各种计算和内存访问模式的一组广泛使用的基准应用程序进行实验。我们在Intel Arria GX板上进行的实验表明,该方法有效地改善了大多数内核的记忆带宽利用,尤其是那些表现出不规则内存访问模式的内核。反过来,这在某些情况下会导致绩效提高。
In an effort to lower the barrier to the adoption of FPGAs by a broader community, today major FPGA vendors offer compiler toolchains for OpenCL code. While using these toolchain allows porting existing code to FPGAs, ensuring performance portability across devices (i.e., CPUs, GPUs and FPGAs) is not a trivial task. This is in part due to the different hardware characteristics of these devices, including the nature of the hardware parallelism and the memory bandwidth they offer. In particular, global memory accesses are known to be one of the main performance bottlenecks for OpenCL kernels deployed on FPGA. In this paper, we investigate the use of pipes to improve memory bandwidth utilization and performance of OpenCL kernels running on FPGA. This is done by separating the global memory accesses from the computation, enabling better use of the load units required to access global memory. We perform experiments on a set of broadly used benchmark applications with various compute and memory access patterns. Our experiments, conducted on an Intel Arria GX board, show that the proposed method is effective in improving the memory bandwidth utilization of most kernels, particularly those exhibiting irregular memory access patterns. This, in turn, leads to performance improvements, in some cases significant.