论文标题

ATCO2语料库:一个大型数据集,用于研究自动语音识别和自然语言对空中交通管制通信的理解

ATCO2 corpus: A Large-Scale Dataset for Research on Automatic Speech Recognition and Natural Language Understanding of Air Traffic Control Communications

论文作者

Zuluaga-Gomez, Juan, Veselý, Karel, Szöke, Igor, Blatt, Alexander, Motlicek, Petr, Kocour, Martin, Rigault, Mickael, Choukri, Khalid, Prasad, Amrutha, Sarfjoo, Seyyed Saeed, Nigmatulina, Iuliia, Cevenini, Claudia, Kolčárek, Pavel, Tart, Allan, Černocký, Jan, Klakow, Dietrich

论文摘要

在我们相互联系的数字世界中,个人助理,自动语音认可和对话理解系统变得越来越重要。一个明显的例子是空中交通管制(ATC)通信。 ATC旨在以安全,最佳的方式引导飞机和控制空域。这些基于语音的对话是通过高频无线电通道之间的空中交通管制员(ATCO)和飞行员之间进行的。为了将这些新技术纳入ATC(低资源域),需要大规模注释的数据集来开发数据驱动的AI系统。两个例子是自动语音识别(ASR)和自然语言理解(NLU)。在本文中,我们介绍了ATCO2语料库,该数据集旨在促进对充满挑战的ATC领域的研究,由于缺乏带注释的数据,该数据集落后于挑战性的ATC领域。 ATCO2语料库涵盖1)数据收集和预处理,2)语音数据的伪注销,以及3)提取与ATC相关的命名实体的提取。 ATCO2语料库分为三个子集。 1)ATCO2检验库语料库包含4个小时的ATC语音,带有手动成绩单,并带有带有命名实体识别的金注释的子集(Callign,命令,命令,值)。 2)ATCO2-PL-SET语料库由5281小时的未标记的ATC数据组成,该数据富含来自域中的语音识别器,上下文信息,说话者转向信息,信噪比估计值和英语语言检测分数每个样本的自动成绩单。两者都可以通过http://catalog.elra.info/en-us/repository/browse/browse/elra-s0484购买。 3)ATCO2测试-Set-1H语料库是原始测试集语料库的一个小时子集,我们可以在https://www.atco2.org/data上免费提供。我们预计,ATCO2语料库将不仅在ATC通信领域,而且在一般研究社区中促进有关强大的ASR和NLU的研究。

Personal assistants, automatic speech recognizers and dialogue understanding systems are becoming more critical in our interconnected digital world. A clear example is air traffic control (ATC) communications. ATC aims at guiding aircraft and controlling the airspace in a safe and optimal manner. These voice-based dialogues are carried between an air traffic controller (ATCO) and pilots via very-high frequency radio channels. In order to incorporate these novel technologies into ATC (low-resource domain), large-scale annotated datasets are required to develop the data-driven AI systems. Two examples are automatic speech recognition (ASR) and natural language understanding (NLU). In this paper, we introduce the ATCO2 corpus, a dataset that aims at fostering research on the challenging ATC field, which has lagged behind due to lack of annotated data. The ATCO2 corpus covers 1) data collection and pre-processing, 2) pseudo-annotations of speech data, and 3) extraction of ATC-related named entities. The ATCO2 corpus is split into three subsets. 1) ATCO2-test-set corpus contains 4 hours of ATC speech with manual transcripts and a subset with gold annotations for named-entity recognition (callsign, command, value). 2) The ATCO2-PL-set corpus consists of 5281 hours of unlabeled ATC data enriched with automatic transcripts from an in-domain speech recognizer, contextual information, speaker turn information, signal-to-noise ratio estimate and English language detection score per sample. Both available for purchase through ELDA at http://catalog.elra.info/en-us/repository/browse/ELRA-S0484. 3) The ATCO2-test-set-1h corpus is a one-hour subset from the original test set corpus, that we are offering for free at https://www.atco2.org/data. We expect the ATCO2 corpus will foster research on robust ASR and NLU not only in the field of ATC communications but also in the general research community.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源