论文标题
EPIC30M:超过3000万个相关推文的流行病语料库
EPIC30M: An Epidemics Corpus Of Over 30 Million Relevant Tweets
论文作者
论文摘要
自COVID-19的开始以来,文献中介绍了来自各种来源的几个相关语料库,其中包含数百万个数据点。尽管这些语料库对于支持这一特定大流行的许多分析非常有价值,但研究人员需要其他基准语料库,这些语料库包含其他流行病,以促进跨流动模式识别和趋势分析任务。在我们在Covid-19相关工作方面的其他工作中,我们发现文献中与疾病相关的语料库很少,它们足够丰富,足以支持这种跨流动分析任务。在本文中,我们介绍了Epic30m,这是一种大规模流行病语料库,其中包含300万个微博客帖子,即从2006年到2020年从Twitter爬行的推文。EPIC30M包含26.2亿个与三个一般疾病相关的26.2亿个,包括三个一般性疾病,包括埃博拉(Embola),cholla and Swine Flue and Swine forex septim of Suildime of Settive extim rem settiv of 4.7包括2009 H1N1猪流感,2010年海地霍乱,2012年中东呼吸综合症(MERS),2013年西非埃博拉病毒,2016年也门霍乱和2018年Kivu Ebola。此外,我们探索并讨论了每个子集的关键术语和标签和趋势分析的语料库的属性。最后,我们通过讨论近年来引起越来越多的兴趣的跨流动研究主题的多种用例来证明EPIC30M可以创造的价值和影响。这些用例涵盖了多个研究领域,例如流行病学建模,模式识别,自然语言理解和经济模型。
Since the start of COVID-19, several relevant corpora from various sources are presented in the literature that contain millions of data points. While these corpora are valuable in supporting many analyses on this specific pandemic, researchers require additional benchmark corpora that contain other epidemics to facilitate cross-epidemic pattern recognition and trend analysis tasks. During our other efforts on COVID-19 related work, we discover very little disease related corpora in the literature that are sizable and rich enough to support such cross-epidemic analysis tasks. In this paper, we present EPIC30M, a large-scale epidemic corpus that contains 30 millions micro-blog posts, i.e., tweets crawled from Twitter, from year 2006 to 2020. EPIC30M contains a subset of 26.2 millions tweets related to three general diseases, namely Ebola, Cholera and Swine Flu, and another subset of 4.7 millions tweets of six global epidemic outbreaks, including 2009 H1N1 Swine Flu, 2010 Haiti Cholera, 2012 Middle-East Respiratory Syndrome (MERS), 2013 West African Ebola, 2016 Yemen Cholera and 2018 Kivu Ebola. Furthermore, we explore and discuss the properties of the corpus with statistics of key terms and hashtags and trends analysis for each subset. Finally, we demonstrate the value and impact that EPIC30M could create through a discussion of multiple use cases of cross-epidemic research topics that attract growing interest in recent years. These use cases span multiple research areas, such as epidemiological modeling, pattern recognition, natural language understanding and economical modeling.