论文标题
通过对空间的替代处理来改善象征化
Improving Tokenisation by Alternative Treatment of Spaces
论文作者
论文摘要
Tokenisation是几乎所有NLP任务中的第一步,基于最新的变压器的语言模型均使用子字引用算法来处理输入文本。现有的算法存在问题,通常会产生有限的语言有效性的象征性,并且根据单词在单词中的位置而代表等效字符串。我们假设这些问题阻碍了基于变压器模型处理复杂词的能力,并认为这些问题是允许令牌包括空间的结果。因此,我们使用一种替代令牌化方法进行实验,其中始终将空间视为单个令牌。具体而言,我们将此修改应用于BPE和UMIGRAM算法。我们发现,经过修改的算法导致在下游NLP任务上的性能提高,涉及处理复杂的单词,同时对一般自然语言理解任务的性能没有不利影响。本质上,我们发现我们的修饰算法在处理前缀时尤其是形态上正确的象征性。鉴于我们的实验结果,我们主张始终将空间视为单个令牌作为改进的令牌化方法。
Tokenisation is the first step in almost all NLP tasks, and state-of-the-art transformer-based language models all use subword tokenisation algorithms to process input text. Existing algorithms have problems, often producing tokenisations of limited linguistic validity, and representing equivalent strings differently depending on their position within a word. We hypothesise that these problems hinder the ability of transformer-based models to handle complex words, and suggest that these problems are a result of allowing tokens to include spaces. We thus experiment with an alternative tokenisation approach where spaces are always treated as individual tokens. Specifically, we apply this modification to the BPE and Unigram algorithms. We find that our modified algorithms lead to improved performance on downstream NLP tasks that involve handling complex words, whilst having no detrimental effect on performance in general natural language understanding tasks. Intrinsically, we find our modified algorithms give more morphologically correct tokenisations, in particular when handling prefixes. Given the results of our experiments, we advocate for always treating spaces as individual tokens as an improved tokenisation method.