论文标题
文本表示形式的深度聚类,用于无监督的语法探测
Deep Clustering of Text Representations for Supervision-free Probing of Syntax
论文作者
论文摘要
我们探索文本表示的深入聚类,以无监督的模型解释和语法诱导。由于这些表示是高维的,因此像Kmeans这样的开箱即用方法无法正常工作。因此,我们的方法共同将表示形式转变为较低维的群集友好的空间,并将其聚集。我们在这项工作中考虑了两个语法的概念:语音归纳(POSI)和选区标签(COLAB)。有趣的是,我们发现多语言伯特(Mbert)包含了令人惊讶的英语句法知识。甚至可能与英语伯特(Ebert)一样多。我们的模型可以用作无监督探针,这可以说是一种较低的探测方式。我们发现,与监督探针相比,无监督的探针显示出较高层的好处。我们进一步指出,我们的无监督探针利用Ebert和Mbert的表示方式有所不同,尤其是对于POSI。我们通过证明其作为无监督语法诱导技术的能力来验证探针的功效。我们的探测器通过简单地调整输入表示形式来适应两种句法形式主义。我们报告了我们对45个标签英语POSI的探测,在10种语言上对12-tag POSI的最先进的表现以及COLAB上的竞争成果。我们还对资源贫困语言进行零射击语法感应,并报告有力的结果。
We explore deep clustering of text representations for unsupervised model interpretation and induction of syntax. As these representations are high-dimensional, out-of-the-box methods like KMeans do not work well. Thus, our approach jointly transforms the representations into a lower-dimensional cluster-friendly space and clusters them. We consider two notions of syntax: Part of speech Induction (POSI) and constituency labelling (CoLab) in this work. Interestingly, we find that Multilingual BERT (mBERT) contains surprising amount of syntactic knowledge of English; possibly even as much as English BERT (EBERT). Our model can be used as a supervision-free probe which is arguably a less-biased way of probing. We find that unsupervised probes show benefits from higher layers as compared to supervised probes. We further note that our unsupervised probe utilizes EBERT and mBERT representations differently, especially for POSI. We validate the efficacy of our probe by demonstrating its capabilities as an unsupervised syntax induction technique. Our probe works well for both syntactic formalisms by simply adapting the input representations. We report competitive performance of our probe on 45-tag English POSI, state-of-the-art performance on 12-tag POSI across 10 languages, and competitive results on CoLab. We also perform zero-shot syntax induction on resource impoverished languages and report strong results.