映射语言：全球语言的语料库

论文标题

映射语言：全球语言的语料库

Mapping Languages: The Corpus of Global Language Use

论文作者

Dunn, Jonathan

论文摘要

本文介绍了基于网络的全球语言使用语料库，重点是如何将其用于数据驱动语言映射。首先，该语料库提供了国家使用主要语言（例如英语，阿拉伯语，俄语）的代表，并始终收集到每种品种的数据。其次，本文评估了一种语言识别模型，该模型支持与替代现成模型相比，样本量较小的本地语言。改进的语言识别对于超越多数语言至关重要。鉴于对语言映射的关注，本文通过（i）系统地将语料库与人口统计学基础数据进行比较，并通过基于Twitter的数据集将语料库进行了分析，从而分析了该数字语言数据对实际人群的表达程度。总体而言，该语料库包含4230亿个单词，代表148种语言（每种语言中有超过100万个单词）和158个国家（每个国家 /地区都有超过100万个单词），所有这些单词都从普通爬网网络数据中提炼出来。除了描述这种公开可用的语料库外，本文的主要贡献是对两个数字数据来源（Web和Twitter）之间的关系进行全面分析以及它们与基本人群的联系。

This paper describes a web-based corpus of global language use with a focus on how this corpus can be used for data-driven language mapping. First, the corpus provides a representation of where national varieties of major languages are used (e.g., English, Arabic, Russian) together with consistently collected data for each variety. Second, the paper evaluates a language identification model that supports more local languages with smaller sample sizes than alternative off-the-shelf models. Improved language identification is essential for moving beyond majority languages. Given the focus on language mapping, the paper analyzes how well this digital language data represents actual populations by (i) systematically comparing the corpus with demographic ground-truth data and (ii) triangulating the corpus with an alternate Twitter-based dataset. In total, the corpus contains 423 billion words representing 148 languages (with over 1 million words from each language) and 158 countries (again with over 1 million words from each country), all distilled from Common Crawl web data. The main contribution of this paper, in addition to describing this publicly-available corpus, is to provide a comprehensive analysis of the relationship between two sources of digital data (the web and Twitter) as well as their connection to underlying populations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题