时间感知数据集是新常态的自适应知识库

论文标题

时间感知数据集是新常态的自适应知识库

Time-Aware Datasets are Adaptive Knowledgebases for the New Normal

论文作者

Suprem, Abhijit, Vaidya, Sanjyot, Ferreira, Joao Eduardo, Pu, Calton

论文摘要

语言模型中文本分类和知识捕获的最新进展依赖于大规模文本数据集的可用性。但是，语言模型接受了知识静态快照的培训，并且当知识发展时受到限制。这对于错误的信息检测尤其重要，在这种发现不断出现新类型的错误信息，以取代旧的广告系列。我们提出了时间感知的错误信息数据集，以捕获时间关键现象。在本文中，我们首先提供了不断发展的错误信息的证据，并表明将简单的时间意识纳入显着提高了分类器的准确性。其次，我们提出了Covid-Tad，这是一个大规模的Covid-19误导性DA-TASET，跨越了25个月。这是第一个大规模错误信息数据集，其中包含数据流的多个快照，并且是比相关的错误信息数据集大的数量级。我们描述了收集和标签亲海事以及初步实验。

Recent advances in text classification and knowledge capture in language models have relied on availability of large-scale text datasets. However, language models are trained on static snapshots of knowledge and are limited when that knowledge evolves. This is especially critical for misinformation detection, where new types of misinformation continuously appear, replacing old campaigns. We propose time-aware misinformation datasets to capture time-critical phenomena. In this paper, we first present evidence of evolving misinformation and show that incorporating even simple time-awareness significantly improves classifier accuracy. Second, we present COVID-TAD, a large-scale COVID-19 misinformation da-taset spanning 25 months. It is the first large-scale misinformation dataset that contains multiple snapshots of a datastream and is orders of magnitude bigger than related misinformation datasets. We describe the collection and labeling pro-cess, as well as preliminary experiments.

下载PDF全文

下载文献需遵守相关版权规定

论文标题