论文标题
Abnirml:分析神经IR模型的行为
ABNIRML: Analyzing the Behavior of Neural IR Models
论文作者
论文摘要
预验证的上下文化语言模型(例如Bert和T5)已建立了一个新的临时搜索最新技术。但是,尚不理解为什么这些方法如此有效,是什么使某些变体比其他变体更有效,以及它们可能拥有的陷阱。我们提出了一个新的综合框架,用于分析神经IR模型的行为(Abnirml),其中包括新型的诊断探针,使我们能够测试几种特征(例如写作样式,事实,对释义和单词顺序的敏感性) - 以前的技术未解决。为了证明框架的价值,我们进行了一项广泛的实证研究,该研究能够深入了解导致神经模型增长的因素,并确定模型所展示的潜在意外意外偏见。我们的某些结果证实了传统的观点,例如最近的神经排名模型不太依赖于与查询的确切术语重叠,而是利用更丰富的语言信息,这证明了它们对单词和句子顺序的更高敏感性。其他结果更令人惊讶,例如某些模型(例如T5和Colbert)偏向于事实正确(而不是简单地相关)文本。此外,由于模型训练期间的随机变化,某些特征甚至对于相同的基本语言模型而有所不同,其他特征也可能出现。
Pretrained contextualized language models such as BERT and T5 have established a new state-of-the-art for ad-hoc search. However, it is not yet well-understood why these methods are so effective, what makes some variants more effective than others, and what pitfalls they may have. We present a new comprehensive framework for Analyzing the Behavior of Neural IR ModeLs (ABNIRML), which includes new types of diagnostic probes that allow us to test several characteristics -- such as writing styles, factuality, sensitivity to paraphrasing and word order -- that are not addressed by previous techniques. To demonstrate the value of the framework, we conduct an extensive empirical study that yields insights into the factors that contribute to the neural model's gains, and identify potential unintended biases the models exhibit. Some of our results confirm conventional wisdom, like that recent neural ranking models rely less on exact term overlap with the query, and instead leverage richer linguistic information, evidenced by their higher sensitivity to word and sentence order. Other results are more surprising, such as that some models (e.g., T5 and ColBERT) are biased towards factually correct (rather than simply relevant) texts. Further, some characteristics vary even for the same base language model, and other characteristics can appear due to random variations during model training.