主短语挖掘

论文标题

Principal Phrase Mining

论文作者

Small, Ellie, Cabrera, Javier

论文摘要

从许多主题中通常执行从文本集中提取频繁的单词。但是，尽管从文本中获取常见单词的集合也有用，但需要以最常见的短语形式从文本中获得更具体的信息。尽管需要这种需要，但通常由于固有的并发症而通常进行频繁提取短语，最重要的是双重计数。当单词或短语出现在本身也被计数的较长短语中计算出来时，会发生双重计数，从而导致选择大多数毫无意义的短语，这些短语仅是因为它们发生在频繁的超级短语中。已经写了几篇论文，上面写着挖掘，描述了解决这个问题的解决方案。但是，他们要么需要一个所谓的质量短语列表，要么可以用于提取过程，要么需要人类的互动以在此过程中识别这些质量短语。我们在这里提出了一种通过独特的整流过程消除双重计数的方法，该过程不需要质量短语列表。在一组文本的上下文中，我们将主短语定义为不交叉标点标记的短语，不是从停止单词开始，除了停止单词“不”和“否”除外，不以停止字的结尾，在这些文本中频繁地在这些文本中经常出现，而无需双重计数，并且对用户是有意义的。我们的方法在没有人类投入的情况下独立地标识了此类主短语，并可以在合理的时间内从任何文本中提取。

Extracting frequent words from a collection of texts is commonly performed in many subjects. However, as useful as it is to obtain a collection of commonly occurring words from texts, there is a need for more specific information to be obtained from texts in the form of most commonly occurring phrases. Despite this need, extracting frequent phrases is not commonly done due to inherent complications, the most significant being double-counting. Double-counting occurs when words or phrases are counted when they appear inside longer phrases that themselves are also counted, resulting in a selection of mostly meaningless phrases that are frequent only because they occur inside frequent super phrases. Several papers have been written on phrase mining that describe solutions to this issue; however, they either require a list of so-called quality phrases to be available to the extracting process, or they require human interaction to identify those quality phrases during the process. We present here a method that eliminates double-counting via a unique rectification process that does not require lists of quality phrases. In the context of a set of texts, we define a principal phrase as a phrase that does not cross punctuation marks, does not start with a stop word, with the exception of the stop words "not" and "no", does not end with a stop word, is frequent within those texts without being double counted, and is meaningful to the user. Our method identifies such principal phrases independently without human input, and enables their extraction from any texts within a reasonable amount of time.

下载PDF全文

下载文献需遵守相关版权规定

论文标题