不监督的语义散发与成对重建

论文标题

不监督的语义散发与成对重建

Unsupervised Semantic Hashing with Pairwise Reconstruction

论文作者

Hansen, Casper, Hansen, Christian, Simonsen, Jakob Grue, Alstrup, Stephen, Lioma, Christina

论文摘要

语义散列是在大规模数据集中有效相似性搜索的一种流行的方法。在语义散列中，文档被编码为短二进制向量（即哈希码），因此可以使用锤距可以有效地计算语义相似性。最近的最新方法利用弱监督来训练更好的性能散步模型。受到这一点的启发，我们提供了配对重建（PairRec）的语义哈希，这是一个离散的基于自动编码器的哈希模型。 PAIRREC首先将弱监督的培训对（查询文档和语义上类似文档）编码为两个哈希码，然后学会从这两个哈希码（即成对重建）中重新构建相同的查询文档。这种成对的重建使我们的模型可以直接通过解码器编码哈希代码中的本地邻域结构。我们通过实验性地将Pairec与传统和最新方法进行比较，并在文档相似性搜索的任务中获得显着的性能改进。

Semantic Hashing is a popular family of methods for efficient similarity search in large-scale datasets. In Semantic Hashing, documents are encoded as short binary vectors (i.e., hash codes), such that semantic similarity can be efficiently computed using the Hamming distance. Recent state-of-the-art approaches have utilized weak supervision to train better performing hashing models. Inspired by this, we present Semantic Hashing with Pairwise Reconstruction (PairRec), which is a discrete variational autoencoder based hashing model. PairRec first encodes weakly supervised training pairs (a query document and a semantically similar document) into two hash codes, and then learns to reconstruct the same query document from both of these hash codes (i.e., pairwise reconstruction). This pairwise reconstruction enables our model to encode local neighbourhood structures within the hash code directly through the decoder. We experimentally compare PairRec to traditional and state-of-the-art approaches, and obtain significant performance improvements in the task of document similarity search.

下载PDF全文

下载文献需遵守相关版权规定

论文标题