论文标题
越南文本到SQL语义解析的试点研究
A Pilot Study of Text-to-SQL Semantic Parsing for Vietnamese
论文作者
论文摘要
语义解析是重要的NLP任务。但是,越南人是该研究领域的低资源语言。在本文中,我们介绍了第一个公共大规模的文本到SQL语义解析数据集。我们在我们的数据集中扩展和评估了两个强大的语义解析基线Editsql(Zhang等,2019)和IRNET(Guo等,2019)。我们将两个基线与关键配置进行比较,并发现:自动越南单词分割改善了两个基准的解析结果;归一化的互相信息(NPMI)得分(Bouma,2009)对于模式链接很有用。从神经依赖解析器中提取的潜在句法特征也可以改善结果;越南人(Nguyen and Nguyen,2020)单语言模型Phobert(2020)比最近最好的多语言语言模型XLM-R(Conneau等,2020)有助于产生更高的性能。
Semantic parsing is an important NLP task. However, Vietnamese is a low-resource language in this research area. In this paper, we present the first public large-scale Text-to-SQL semantic parsing dataset for Vietnamese. We extend and evaluate two strong semantic parsing baselines EditSQL (Zhang et al., 2019) and IRNet (Guo et al., 2019) on our dataset. We compare the two baselines with key configurations and find that: automatic Vietnamese word segmentation improves the parsing results of both baselines; the normalized pointwise mutual information (NPMI) score (Bouma, 2009) is useful for schema linking; latent syntactic features extracted from a neural dependency parser for Vietnamese also improve the results; and the monolingual language model PhoBERT for Vietnamese (Nguyen and Nguyen, 2020) helps produce higher performances than the recent best multilingual language model XLM-R (Conneau et al., 2020).