语义敏感的TF-IDF确定文档中的单词相关性

论文标题

语义敏感的TF-IDF确定文档中的单词相关性

Semantic Sensitive TF-IDF to Determine Word Relevance in Documents

论文作者

Jalilifard, Amir, Caridá, Vinicius F., Mansano, Alex F., Cristo, Rogers S., da Fonseca, Felipe Penhorate C.

论文摘要

关键字提取作为一个重要的研究主题受到了越来越多的关注，该主题可能会导致在诸如文档上下文分类，文本索引和文档分类等不同应用程序中取得进步。在本文中，我们提出了一种基于TF-IDF的新型语义方法STF-IDF，以评估语料库中非正式文档的重要性。收集了来自卫生保健社交媒体的近400万个文件，并接受了培训，以绘制语义模型并找到嵌入一词。然后，语义空间的特征被用来通过迭代解决方案重新排列原始的TF-IDF分数，以提高该算法在非正式文本上的中等性能。在用200个随机选择的文档测试所提出的方法后，我们的方法设法将TF-IDF的平均错误率降低了50％，并达到平均误差为13.7％，而原始TF-IDF的27.2％。

Keyword extraction has received an increasing attention as an important research topic which can lead to have advancements in diverse applications such as document context categorization, text indexing and document classification. In this paper we propose STF-IDF, a novel semantic method based on TF-IDF, for scoring word importance of informal documents in a corpus. A set of nearly four million documents from health-care social media was collected and was trained in order to draw semantic model and to find the word embeddings. Then, the features of semantic space were utilized to rearrange the original TF-IDF scores through an iterative solution so as to improve the moderate performance of this algorithm on informal texts. After testing the proposed method with 200 randomly chosen documents, our method managed to decrease the TF-IDF mean error rate by a factor of 50% and reaching the mean error of 13.7%, as opposed to 27.2% of the original TF-IDF.

下载PDF全文

下载文献需遵守相关版权规定

论文标题