经过成本效益的数据预处理数据：社交媒体上BERT预处理的案例研究

论文标题

经过成本效益的数据预处理数据：社交媒体上BERT预处理的案例研究

Cost-effective Selection of Pretraining Data: A Case Study of Pretraining BERT on Social Media

论文作者

Dai, Xiang, Karimi, Sarvnaz, Hachey, Ben, Paris, Cecile

论文摘要

对域特异性BERT模型的最新研究表明，在鉴定内域数据鉴定模型时，可以提高对下游任务的有效性。通常，这些模型中使用的预处理数据是根据其主题（例如生物学或计算机科学）选择的。鉴于使用社交媒体文本及其独特语言品种的应用程序范围，我们分别在推文和论坛文本上为两个模型预算了两个模型，并经验证明了这两种资源的有效性。此外，我们研究了如何使用相似性衡量标准来提名内域预处理数据。我们在https://bit.ly/35rptf0上公开发布了验证的模型。

Recent studies on domain-specific BERT models show that effectiveness on downstream tasks can be improved when models are pretrained on in-domain data. Often, the pretraining data used in these models are selected based on their subject matter, e.g., biology or computer science. Given the range of applications using social media text, and its unique language variety, we pretrain two models on tweets and forum text respectively, and empirically demonstrate the effectiveness of these two resources. In addition, we investigate how similarity measures can be used to nominate in-domain pretraining data. We publicly release our pretrained models at https://bit.ly/35RpTf0.

下载PDF全文

下载文献需遵守相关版权规定

论文标题