从Twitter得出的长期单词频率动力学已损坏：一种定制方法来检测和去除时间序列的病理学

论文标题

从Twitter得出的长期单词频率动力学已损坏：一种定制方法来检测和去除时间序列的病理学

Long-term word frequency dynamics derived from Twitter are corrupted: A bespoke approach to detecting and removing pathologies in ensembles of time series

论文作者

Dodds, P. S., Minot, J. R., Arnold, M. V., Alshaabi, T., Adams, J. L., Dewhurst, D. R., Reagan, A. J., Danforth, C. M.

论文摘要

保持长期数据收集的完整性是基本的科学实践。随着场地的发展，该领域的测量仪器和数据存储系统的发明，改进并使其过时了。对于不透明社会技术系统生成的数据流，可能具有情节性和未知内部规则变化，检测和考虑历史数据集中的变化需要保持警惕和创造性分析。在这里，我们表明，实时收集的Twitter的日常单词用法频率时间序列大约10 \％，一组大约10,000个经常使用的单词超过10年，来自Tweets，实际上是损坏的语言标签。我们描述了如何发现有问题的信号，同时比较不同时间范围的单词用法。我们找到Twitter打开或关闭不同种类的语言标识算法以及数据格式可能已更改的时间点。然后，我们展示如何创建一个用于识别和删除病理时间序列的单词的统计量。特别是我们从时间序列中删除“不良”时间序列的最终过程是特别的，但导致其构建的方法可能是可以推广的。

Maintaining the integrity of long-term data collection is an essential scientific practice. As a field evolves, so too will that field's measurement instruments and data storage systems, as they are invented, improved upon, and made obsolete. For data streams generated by opaque sociotechnical systems which may have episodic and unknown internal rule changes, detecting and accounting for shifts in historical datasets requires vigilance and creative analysis. Here, we show that around 10\% of day-scale word usage frequency time series for Twitter collected in real time for a set of roughly 10,000 frequently used words for over 10 years come from tweets with, in effect, corrupted language labels. We describe how we uncovered problematic signals while comparing word usage over varying time frames. We locate time points where Twitter switched on or off different kinds of language identification algorithms, and where data formats may have changed. We then show how we create a statistic for identifying and removing words with pathological time series. While our resulting process for removing `bad' time series from ensembles of time series is particular, the approach leading to its construction may be generalizeable.

下载PDF全文

下载文献需遵守相关版权规定

论文标题