表征带有树投影的变压器中的内在组成性

论文标题

表征带有树投影的变压器中的内在组成性

Characterizing Intrinsic Compositionality in Transformers with Tree Projections

论文作者

Murty, Shikhar, Sharma, Pratyusha, Andreas, Jacob, Manning, Christopher D.

论文摘要

当对语言数据进行培训时，变形金刚是否会学习一些任意计算，这些计算利用了体系结构的全部容量，还是学习更简单，类似树状的计算，假设是人类语言等组成含义系统的基础？人类语言理解的组成叙述之间存在明显的张力，这是基于受限制的自下而上计算过程的，以及诸如变形金刚之类的神经模型的巨大成功，这些神经模型可以在其输入的不同部分之间任意路由信息。一种可能性是，这些模型虽然原则上非常灵活，但实际上，这些模型会学会从层次上解释语言，最终构建句子表示，通过自下而上的树结构模型可预测的句子来解释。为了评估这种可能性，我们将无监督和无参数的方法描述为\ emph {功能性项目}任何变压器的行为中，将任何变压器的行为转化为树结构化网络的空间。给定输入句子，我们产生了一个二进制树，该二进制树近似于变压器的表示形式构建过程，并捕获了捕获变压器的行为在输入上的“类似树状”的分数。虽然该分数的计算不需要训练任何其他模型，但事实证明，变压器与任何树结构化近似之间的拟合。使用这种方法，我们表明，在训练过程中，用于三个不同任务的变压器变成了更类似树的变压器，在某些情况下，不可避免地会恢复与监督解析器相同的树木。反过来，这些树是模型行为的预测，在构图概括测试中，更多类似树状的模型可以更好地推广。

When trained on language data, do transformers learn some arbitrary computation that utilizes the full capacity of the architecture or do they learn a simpler, tree-like computation, hypothesized to underlie compositional meaning systems like human languages? There is an apparent tension between compositional accounts of human language understanding, which are based on a restricted bottom-up computational process, and the enormous success of neural models like transformers, which can route information arbitrarily between different parts of their input. One possibility is that these models, while extremely flexible in principle, in practice learn to interpret language hierarchically, ultimately building sentence representations close to those predictable by a bottom-up, tree-structured model. To evaluate this possibility, we describe an unsupervised and parameter-free method to \emph{functionally project} the behavior of any transformer into the space of tree-structured networks. Given an input sentence, we produce a binary tree that approximates the transformer's representation-building process and a score that captures how "tree-like" the transformer's behavior is on the input. While calculation of this score does not require training any additional models, it provably upper-bounds the fit between a transformer and any tree-structured approximation. Using this method, we show that transformers for three different tasks become more tree-like over the course of training, in some cases unsupervisedly recovering the same trees as supervised parsers. These trees, in turn, are predictive of model behavior, with more tree-like models generalizing better on tests of compositional generalization.

下载PDF全文

下载文献需遵守相关版权规定

论文标题