孟加拉文学中的作者归因使用字符级别CNN

论文标题

孟加拉文学中的作者归因使用字符级别CNN

Authorship Attribution in Bangla literature using Character-level CNN

论文作者

Khatun, Aisha, Rahman, Anisur, Islam, Md. Saiful, Marium-E-Jannat

论文摘要

字符是最小的文本单元，可以提取口测信号来确定文本的作者。在本文中，我们研究了角色级信号在孟加拉文学作者归因中的有效性，并表明结果是有希望的，但可以改进。所提出的模型的时间和内存效率远高于单词级别的对应物，但准确性比最佳性能级别模型低2-5％。进行了各种基于单词的模型的比较，并表明所提出的模型在较大的数据集中的性能越来越好。我们还分析了在作者身份归因中设置的孟加拉角色的训练前角色嵌入的效果。可以看出，在预训练中，性能提高了多达10％。我们使用了2个来自6到14位作者的数据集，在培训之前平衡它们并比较结果。

Characters are the smallest unit of text that can extract stylometric signals to determine the author of a text. In this paper, we investigate the effectiveness of character-level signals in Authorship Attribution of Bangla Literature and show that the results are promising but improvable. The time and memory efficiency of the proposed model is much higher than the word level counterparts but accuracy is 2-5% less than the best performing word-level models. Comparison of various word-based models is performed and shown that the proposed model performs increasingly better with larger datasets. We also analyze the effect of pre-training character embedding of diverse Bangla character set in authorship attribution. It is seen that the performance is improved by up to 10% on pre-training. We used 2 datasets from 6 to 14 authors, balancing them before training and compare the results.

下载PDF全文

下载文献需遵守相关版权规定

论文标题