论文标题

量化合成和融合及其对机器翻译的影响

Quantifying Synthesis and Fusion and their Impact on Machine Translation

论文作者

Oncevay, Arturo, Ataman, Duygu, van Berkel, Niels, Haddow, Barry, Birch, Alexandra, Bjerva, Johannes

论文摘要

形态类型学的理论工作提供了在连续规模上衡量形态多样性的可能性。但是,自然语言处理(NLP)中的文献通常标记具有严格形态类型的全语言,例如融合或凝集。在这项工作中,我们建议通过量化单词和细分级别的形态类型学来降低此类主张的僵化。我们考虑Payne(2017)使用两个指标对形态进行分类的方法:合成(例如分析到多合成)和融合(融合融合到融合)。对于计算合成,我们测试了英语,德语和土耳其语的无监督和监督的形态分割方法,而对于融合,我们提出了一种使用西班牙语作为案例研究的半自动方法。然后,我们分析了机器翻译质量与单词(英语 - turkish的名词和动词)与英语 - 西班牙语中的动词)和细分级别(以前的语言对以及两个方向上的英语 - 阵线)之间的关系之间的关系。我们通过人类评估来补充单词级别的分析,总的来说,我们观察到这两个索引对机器翻译质量的一致影响。

Theoretical work in morphological typology offers the possibility of measuring morphological diversity on a continuous scale. However, literature in Natural Language Processing (NLP) typically labels a whole language with a strict type of morphology, e.g. fusional or agglutinative. In this work, we propose to reduce the rigidity of such claims, by quantifying morphological typology at the word and segment level. We consider Payne (2017)'s approach to classify morphology using two indices: synthesis (e.g. analytic to polysynthetic) and fusion (agglutinative to fusional). For computing synthesis, we test unsupervised and supervised morphological segmentation methods for English, German and Turkish, whereas for fusion, we propose a semi-automatic method using Spanish as a case study. Then, we analyse the relationship between machine translation quality and the degree of synthesis and fusion at word (nouns and verbs for English-Turkish, and verbs in English-Spanish) and segment level (previous language pairs plus English-German in both directions). We complement the word-level analysis with human evaluation, and overall, we observe a consistent impact of both indexes on machine translation quality.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源