增强实体提及具有全局上下文嵌入的针对性Twitter流的检测

论文标题

增强实体提及具有全局上下文嵌入的针对性Twitter流的检测

Boosting Entity Mention Detection for Targetted Twitter Streams with Global Contextual Embeddings

论文作者

Bhowmick, Satadisha Saha, Dragut, Eduard C., Meng, Weiyi

论文摘要

像Twitter这样的微博网站已成为无处不在的信息来源。与微博中信息自动提取和分析有关的两个重要任务是实体提及检测（EMD）和实体检测（ED）。最先进的EMD系统旨在通过在离线静态数据集上培训微博文本的非文学性质。他们从各个消息的噪音文本建模和实体提取的单个消息中提取了表面级特征（拼字，词汇和语义）的组合。但是，鉴于微博流的不断发展的性质，从这种变化而有限的短消息上下文中检测所有实体仍然是一个困难的问题。为此，我们提出了一个名为EMD Globalizer的框架，非常适合在微博流上执行EMD学习者。它偏离了现有的EMD系统处理孤立的微博消息的处理，从消息的直接上下文中学习的知识用于建议实体。在通过EMD系统最初提取实体候选者之后，提出的框架利用矿场挖掘以查找在第一次检测中遗漏的其他候选者提及。汇总了这些提及的局部上下文表示，从流中实体候选人的集体上下文中得出了全局嵌入。然后将全局嵌入用于将候选人内的实体与误报分开。从框架的最终输出中产生了该流中所述实体的所有提及。我们的实验表明，EMD Globalizer可以通过少量的其他计算开销来提高我们测试（平均每次测试）的所有现有EMD系统的有效性。

Microblogging sites, like Twitter, have emerged as ubiquitous sources of information. Two important tasks related to the automatic extraction and analysis of information in Microblogs are Entity Mention Detection (EMD) and Entity Detection (ED). The state-of-the-art EMD systems aim to model the non-literary nature of microblog text by training upon offline static datasets. They extract a combination of surface-level features -- orthographic, lexical, and semantic -- from individual messages for noisy text modeling and entity extraction. But given the constantly evolving nature of microblog streams, detecting all entity mentions from such varying yet limited context of short messages remains a difficult problem. To this end, we propose a framework named EMD Globalizer, better suited for the execution of EMD learners on microblog streams. It deviates from the processing of isolated microblog messages by existing EMD systems, where learned knowledge from the immediate context of a message is used to suggest entities. After an initial extraction of entity candidates by an EMD system, the proposed framework leverages occurrence mining to find additional candidate mentions that are missed during this first detection. Aggregating the local contextual representations of these mentions, a global embedding is drawn from the collective context of an entity candidate within a stream. The global embeddings are then utilized to separate entities within the candidates from false positives. All mentions of said entities from the stream are produced in the framework's final outputs. Our experiments show that EMD Globalizer can enhance the effectiveness of all existing EMD systems that we tested (on average by 25.61%) with a small additional computational overhead.

下载PDF全文

下载文献需遵守相关版权规定

论文标题