论文标题
直接从语法压缩文本中学习
Learning Directly from Grammar Compressed Text
论文作者
论文摘要
使用许多文本数据的神经网络已成功应用于各种任务。虽然通常使用语法压缩等技术压缩大量文本数据,但几乎所有先前的机器学习方法都将已经解压缩的序列数据作为其输入。在本文中,我们提出了一种直接将神经序列模型应用于文本数据的方法,该文本数据是用语法压缩算法压缩而无需减压的。为了编码在压缩规则中出现的唯一符号,我们介绍了作曲家模块以逐步将符号编码到向量表示中。通过对实际数据集的实验,我们经验表明,该建议模型可以在保持适度性能的同时达到内存和计算效率。
Neural networks using numerous text data have been successfully applied to a variety of tasks. While massive text data is usually compressed using techniques such as grammar compression, almost all of the previous machine learning methods assume already decompressed sequence data as their input. In this paper, we propose a method to directly apply neural sequence models to text data compressed with grammar compression algorithms without decompression. To encode the unique symbols that appear in compression rules, we introduce composer modules to incrementally encode the symbols into vector representations. Through experiments on real datasets, we empirically showed that the proposal model can achieve both memory and computational efficiency while maintaining moderate performance.