论文标题
具有跨语性嵌入的多语言进攻性语言识别
Multilingual Offensive Language Identification with Cross-lingual Embeddings
论文作者
论文摘要
进攻内容在社交媒体中普遍存在,这是公司和政府组织关注的原因。最近已经发表了一些研究,研究了检测到这种内容的各种形式的方法(例如,仇恨言论,网络欺凌和网络参与)。这些研究中的大多数都涉及英语,部分原因是大多数带注释的数据集都包含英语数据。在本文中,我们通过应用跨语性上下文单词嵌入和转移学习来利用可用的英语数据,以减少资源的语言进行预测。我们对孟加拉语,印地语和西班牙语的可比较数据进行预测,并报告孟加拉语0.8415 F1宏的结果,印度语的0.8568 F1宏和西班牙语的0.7513 F1宏。最后,我们表明我们的方法与提交到这三种语言的最新共享任务的最佳系统相比,证实了跨语性上下文嵌入的鲁棒性和对此任务的转移学习。
Offensive content is pervasive in social media and a reason for concern to companies and government organizations. Several studies have been recently published investigating methods to detect the various forms of such content (e.g. hate speech, cyberbulling, and cyberaggression). The clear majority of these studies deal with English partially because most annotated datasets available contain English data. In this paper, we take advantage of English data available by applying cross-lingual contextual word embeddings and transfer learning to make predictions in languages with less resources. We project predictions on comparable data in Bengali, Hindi, and Spanish and we report results of 0.8415 F1 macro for Bengali, 0.8568 F1 macro for Hindi, and 0.7513 F1 macro for Spanish. Finally, we show that our approach compares favorably to the best systems submitted to recent shared tasks on these three languages, confirming the robustness of cross-lingual contextual embeddings and transfer learning for this task.