论文标题
一种使用逆数据注释来创建问题回答语料库的方法
A Methodology for Creating Question Answering Corpora Using Inverse Data Annotation
论文作者
论文摘要
在本文中,我们介绍了一种新颖的方法,以有效地构建一个语料库,以回答结构化数据。为此,我们介绍了一个基于名为“操作树(OT)”数据库中逻辑查询计划的中间表示形式。这种表示使我们能够反转注释过程,而不会失去生成的查询类型的灵活性。此外,它允许将查询代币与OT操作进行细粒度对齐。在我们的方法中,我们从无上下文的语法中随机生成OT。之后,注释者必须编写由OT代表的适当的自然语言问题。最后,注释者将令牌分配给OT操作。我们应用了创建新的语料库OTTA(操作树和令牌分配)的方法,这是一种大型语义解析语料库,用于评估自然语言界面到数据库。我们将OTTA与Spider和LC-Quad 2.0进行了比较,并表明我们的方法在维持查询的复杂性的同时将注释速度的三倍超过了三倍。最后,我们在数据上培训了最先进的语义解析模型,并表明我们的语料库是一个具有挑战性的数据集,并且可以利用令牌对准以显着提高性能。
In this paper, we introduce a novel methodology to efficiently construct a corpus for question answering over structured data. For this, we introduce an intermediate representation that is based on the logical query plan in a database called Operation Trees (OT). This representation allows us to invert the annotation process without losing flexibility in the types of queries that we generate. Furthermore, it allows for fine-grained alignment of query tokens to OT operations. In our method, we randomly generate OTs from a context-free grammar. Afterwards, annotators have to write the appropriate natural language question that is represented by the OT. Finally, the annotators assign the tokens to the OT operations. We apply the method to create a new corpus OTTA (Operation Trees and Token Assignment), a large semantic parsing corpus for evaluating natural language interfaces to databases. We compare OTTA to Spider and LC-QuaD 2.0 and show that our methodology more than triples the annotation speed while maintaining the complexity of the queries. Finally, we train a state-of-the-art semantic parsing model on our data and show that our corpus is a challenging dataset and that the token alignment can be leveraged to increase the performance significantly.