论文标题

简化多面体模型中的依赖减少

Simplifying Dependent Reductions in the Polyhedral Model

论文作者

Yang, Cambridge, Atkinson, Eric, Carbin, Michael

论文摘要

在许多数值计算中,使用协会和交换运算符的降低 - 对一组价值的积累是一种常见的计算,包括科学计算,机器学习,计算机视觉和财务分析。 现代多面体汇编技术使得可以优化诸如前缀总和之类的降低,其中减少输出的每个组件都可能与还原的另一个组件共享计算。因此,优化编译器可以识别多个组件之间共享的计算,并生成仅计算共享计算一次的代码。 但是,这些技术不支持降低,即当用多面体模型的语言措辞时 - 跨越多个依赖性语句。在这种情况下,现有方法可能会生成不正确的代码,从而违反了原始,未取代程序的数据依赖性。 在这项工作中,我们将依赖减少的优化确定为整数双线性程序。我们提出了一种启发式优化算法,该算法使用程序的仿射顺序时间表来确定如何简化降低,但仍能保留程序的依赖性。 我们证明该算法为有关概率推理算法的文献中的一组基准程序提供了最佳复杂性,该算法的性能非常依赖于简化这些降低。这11个程序中10个计划中的10个复杂性至少由输入数据的尺寸中的因素改善,这些因素在典型的真实应用程序输入的$ 10^4 $至$ 10^6 $之间。我们还通过在墙上锁定时间显示加速度来确认改进的重要性,从$ 1.1 \ text {x} $到$ 10^6 \ text {x} $不等。

A Reduction -- an accumulation over a set of values, using an associative and commutative operator -- is a common computation in many numerical computations, including scientific computations, machine learning, computer vision, and financial analytics. Contemporary polyhedral-based compilation techniques make it possible to optimize reductions, such as prefix sums, in which each component of the reduction's output potentially shares computation with another component in the reduction. Therefore an optimizing compiler can identify the computation shared between multiple components and generate code that computes the shared computation only once. These techniques, however, do not support reductions that -- when phrased in the language of the polyhedral model -- span multiple dependent statements. In such cases, existing approaches can generate incorrect code that violates the data dependences of the original, unoptimized program. In this work, we identify and formalize the optimization of dependent reductions as an integer bilinear program. We present a heuristic optimization algorithm that uses an affine sequential schedule of the program to determine how to simplfy reductions yet still preserve the program's dependences. We demonstrate that the algorithm provides optimal complexity for a set of benchmark programs from the literature on probabilistic inference algorithms, whose performance critically relies on simplifying these reductions. The complexities for 10 of the 11 programs improve siginifcantly by factors at least of the sizes of the input data, which are in the range of $10^4$ to $10^6$ for typical real application inputs. We also confirm the significance of the improvement by showing speedups in wall-clock time that range from $1.1\text{x}$ to over $10^6\text{x}$.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源