使用段落向量提高印尼情感分析的BI-LSTM绩效

论文标题

使用段落向量提高印尼情感分析的BI-LSTM绩效

Improving Bi-LSTM Performance for Indonesian Sentiment Analysis Using Paragraph Vector

论文作者

Purwarianti, Ayu, Crisdayanti, Ida Ayu Putu Ari

论文摘要

双向长期短期内存网络（BI-LSTM）在情感分类任务中表现出了有希望的表现。它将输入作为信息序列处理。由于这种行为，BI-LSTM的情感预测受单词序列的影响，文本的第一个或最后一个短语往往比其他短语具有更强的特征。同时，在印尼情感分析的问题范围内，表达文档情感的短语可能不会出现在文档的第一或最后一部分中，这可能会导致不正确的情感分类。为此，我们建议使用称为段落向量的现有文档表示方法作为BI-LSTM的附加输入功能。该向量为每个序列处理提供了文档的信息上下文。该段向量简单地与文档的每个单词向量相连。这种表示还有助于区分模棱两可的印尼单词。 BI-LSTM和段落向量先前被用作单独的方法。结合两种方法已显示印尼情感分析模型的绩效改善。关于测试数据的几个案例研究表明，所提出的方法可以处理BI-LSTM遇到的情感短语位置问题。

Bidirectional Long Short-Term Memory Network (Bi-LSTM) has shown promising performance in sentiment classification task. It processes inputs as sequence of information. Due to this behavior, sentiment predictions by Bi-LSTM were influenced by words sequence and the first or last phrases of the texts tend to have stronger features than other phrases. Meanwhile, in the problem scope of Indonesian sentiment analysis, phrases that express the sentiment of a document might not appear in the first or last part of the document that can lead to incorrect sentiment classification. To this end, we propose the using of an existing document representation method called paragraph vector as additional input features for Bi-LSTM. This vector provides information context of the document for each sequence processing. The paragraph vector is simply concatenated to each word vector of the document. This representation also helps to differentiate ambiguous Indonesian words. Bi-LSTM and paragraph vector were previously used as separate methods. Combining the two methods has shown a significant performance improvement of Indonesian sentiment analysis model. Several case studies on testing data showed that the proposed method can handle the sentiment phrases position problem encountered by Bi-LSTM.

下载PDF全文

下载文献需遵守相关版权规定

论文标题