论文标题
通过查询产生的基于嵌入的零射击检索
Embedding-based Zero-shot Retrieval through Query Generation
论文作者
论文摘要
段落检索解决了一个问题,通常是从大型语料库中找到相关段落的问题。在实践中,诸如BM25之类的词汇术语匹配算法是由于其效率而进行的流行选择。但是,基于术语的匹配算法通常会错过与查询没有词汇重叠的相关段落,并且不能对下游数据集进行填充。在这项工作中,我们将基于嵌入式的两个较高架构视为神经检索模型。由于标记的数据可能稀缺,并且由于神经检索模型需要大量的数据训练,因此我们提出了一种新的方法来生成合成训练数据以进行检索。我们的系统产生了显着的结果,在测试的6个数据集中的5个数据集中,有5分的bm25的表现平均为2.45分,而回忆@1。在某些情况下,我们经过合成数据培训的模型甚至可以优于接受实际数据训练的相同模型
Passage retrieval addresses the problem of locating relevant passages, usually from a large corpus, given a query. In practice, lexical term-matching algorithms like BM25 are popular choices for retrieval owing to their efficiency. However, term-based matching algorithms often miss relevant passages that have no lexical overlap with the query and cannot be finetuned to downstream datasets. In this work, we consider the embedding-based two-tower architecture as our neural retrieval model. Since labeled data can be scarce and because neural retrieval models require vast amounts of data to train, we propose a novel method for generating synthetic training data for retrieval. Our system produces remarkable results, significantly outperforming BM25 on 5 out of 6 datasets tested, by an average of 2.45 points for Recall@1. In some cases, our model trained on synthetic data can even outperform the same model trained on real data