越南文本到SQL语义解析的试点研究

论文标题

越南文本到SQL语义解析的试点研究

A Pilot Study of Text-to-SQL Semantic Parsing for Vietnamese

论文作者

Nguyen, Anh Tuan, Dao, Mai Hoang, Nguyen, Dat Quoc

论文摘要

语义解析是重要的NLP任务。但是，越南人是该研究领域的低资源语言。在本文中，我们介绍了第一个公共大规模的文本到SQL语义解析数据集。我们在我们的数据集中扩展和评估了两个强大的语义解析基线Editsql（Zhang等，2019）和IRNET（Guo等，2019）。我们将两个基线与关键配置进行比较，并发现：自动越南单词分割改善了两个基准的解析结果；归一化的互相信息（NPMI）得分（Bouma，2009）对于模式链接很有用。从神经依赖解析器中提取的潜在句法特征也可以改善结果；越南人（Nguyen and Nguyen，2020）单语言模型Phobert（2020）比最近最好的多语言语言模型XLM-R（Conneau等，2020）有助于产生更高的性能。

Semantic parsing is an important NLP task. However, Vietnamese is a low-resource language in this research area. In this paper, we present the first public large-scale Text-to-SQL semantic parsing dataset for Vietnamese. We extend and evaluate two strong semantic parsing baselines EditSQL (Zhang et al., 2019) and IRNet (Guo et al., 2019) on our dataset. We compare the two baselines with key configurations and find that: automatic Vietnamese word segmentation improves the parsing results of both baselines; the normalized pointwise mutual information (NPMI) score (Bouma, 2009) is useful for schema linking; latent syntactic features extracted from a neural dependency parser for Vietnamese also improve the results; and the monolingual language model PhoBERT for Vietnamese (Nguyen and Nguyen, 2020) helps produce higher performances than the recent best multilingual language model XLM-R (Conneau et al., 2020).

下载PDF全文

下载文献需遵守相关版权规定

论文标题