图形神经网络的自我监督预处理，用于检索科学文章中相关数学表达式

论文标题

图形神经网络的自我监督预处理，用于检索科学文章中相关数学表达式

Self-Supervised Pretraining of Graph Neural Network for the Retrieval of Related Mathematical Expressions in Scientific Articles

论文作者

Pfahler, Lukas, Morik, Katharina

论文摘要

鉴于出版物的增加，搜索相关论文变得乏味。特别是，不支持跨学科或思维流派的搜索。这主要是由于关键字查询的检索：技术术语在不同的科学或不同时间不同。相关文章最好通过其数学问题描述来识别。仅查看论文中的方程式已经暗示了论文是否相关。因此，我们提出了一种基于机器学习的数学表达式检索的新方法。我们设计了一项无监督的表示学习任务，该任务将嵌入学习与自我监督的学习结合在一起。使用图形卷积神经网络，我们将数学表达式嵌入到低维矢量空间中，以允许有效的最近的邻居查询。为了培训我们的模型，我们收集了一个庞大的数据集，其中有超过2900万个数学表达式从Arxiv.org上发布的900,000多个出版物。该数学被转换为XML格式，我们将其视为图形数据。我们涉及新的手动注释搜索查询数据集的经验评估显示了使用嵌入模型进行数学检索的好处。这项工作最初发表在KDD 2020。

Given the increase of publications, search for relevant papers becomes tedious. In particular, search across disciplines or schools of thinking is not supported. This is mainly due to the retrieval with keyword queries: technical terms differ in different sciences or at different times. Relevant articles might better be identified by their mathematical problem descriptions. Just looking at the equations in a paper already gives a hint to whether the paper is relevant. Hence, we propose a new approach for retrieval of mathematical expressions based on machine learning. We design an unsupervised representation learning task that combines embedding learning with self-supervised learning. Using graph convolutional neural networks we embed mathematical expression into low-dimensional vector spaces that allow efficient nearest neighbor queries. To train our models, we collect a huge dataset with over 29 million mathematical expressions from over 900,000 publications published on arXiv.org. The math is converted into an XML format, which we view as graph data. Our empirical evaluations involving a new dataset of manually annotated search queries show the benefits of using embedding models for mathematical retrieval. This work was originally published at KDD 2020.

下载PDF全文

下载文献需遵守相关版权规定

论文标题