对比预训练的文本和代码嵌入

论文标题

对比预训练的文本和代码嵌入

Text and Code Embeddings by Contrastive Pre-Training

论文作者

Neelakantan, Arvind, Xu, Tao, Puri, Raul, Radford, Alec, Han, Jesse Michael, Tworek, Jerry, Yuan, Qiming, Tezak, Nikolas, Kim, Jong Wook, Hallacy, Chris, Heidecke, Johannes, Shyam, Pranav, Power, Boris, Nekoul, Tyna Eloundou, Sastry, Girish, Krueger, Gretchen, Schnurr, David, Such, Felipe Petroski, Hsu, Kenny, Thompson, Madeleine, Khan, Tabarak, Sherbakov, Toki, Jang, Joanne, Welinder, Peter, Weng, Lilian

论文摘要

文本嵌入是许多应用程序中有用的功能，例如语义搜索和计算文本相似性。以前的工作通常训练针对不同用例定制的模型，在数据集选择，培训目标和模型体系结构中有所不同。在这项工作中，我们表明对不监督数据的对比预训练会导致文本和代码的高质量向量表示。实现新的最先进的无监督文本嵌入会导致线性探针分类也会显示出令人印象深刻的语义搜索功能，有时甚至可以通过微调模型进行竞争性。在线性探针分类精度上平均7个任务时，我们最佳的无监督模型的相对提高了4％和1.8％，而不是以前的最佳无监督和监督的文本嵌入模型。在大规模语义搜索上进行评估时，相同的文本嵌入量的相对改善比以前的最佳无监督方法的相对改善分别为23.4％，14.7％和10.6％，分别在MSMARCO，自然问题和Triviaqa基准方面的相对提高。与文本嵌入类似，我们在（文本，代码）对上训练代码嵌入模型，从而获得20.8％的相对改进，而相对于代码搜索的先前最佳工作。

Text embeddings are useful features in many applications such as semantic search and computing text similarity. Previous work typically trains models customized for different use cases, varying in dataset choice, training objective and model architecture. In this work, we show that contrastive pre-training on unsupervised data at scale leads to high quality vector representations of text and code. The same unsupervised text embeddings that achieve new state-of-the-art results in linear-probe classification also display impressive semantic search capabilities and sometimes even perform competitively with fine-tuned models. On linear-probe classification accuracy averaging over 7 tasks, our best unsupervised model achieves a relative improvement of 4% and 1.8% over previous best unsupervised and supervised text embedding models respectively. The same text embeddings when evaluated on large-scale semantic search attains a relative improvement of 23.4%, 14.7%, and 10.6% over previous best unsupervised methods on MSMARCO, Natural Questions and TriviaQA benchmarks, respectively. Similarly to text embeddings, we train code embedding models on (text, code) pairs, obtaining a 20.8% relative improvement over prior best work on code search.

下载PDF全文

下载文献需遵守相关版权规定

论文标题