论文标题
统一:二进制代码相似性检测而无需微调
UniASM: Binary Code Similarity Detection without Fine-tuning
论文作者
论文摘要
二进制代码相似性检测(BCSD)被广泛用于各种二进制分析任务,例如漏洞搜索,恶意软件检测,克隆检测和补丁分析。最近的研究表明,基于学习的二进制代码嵌入模型的性能要比传统的基于功能的方法更好。但是,以前的研究并未深入研究影响模型性能的关键因素。在本文中,我们设计了广泛的消融研究,以探索这些影响因素。实验结果为我们提供了许多新见解。我们在代码表示和模型选择方面进行了创新:我们提出了一种新颖的丰富语义功能表示技术,以确保模型捕获二进制代码的复杂细微差别,我们介绍了第一个基于Unilm的二进制二进制代码嵌入模型,名为Uniasm,其中包括两个新设计的培训任务,以学习二进制功能的表示形式。实验结果表明,统一的表现优于评估数据集上的最新方法(SOTA)方法。与基线方法相比,跨跨跨局体,交叉优化级别和交叉抗震动的平均得分分别提高了12.7%,8.5%和22.3%。此外,在已知漏洞搜索的实际任务中,Uniasm的表现优于所有当前基线。
Binary code similarity detection (BCSD) is widely used in various binary analysis tasks such as vulnerability search, malware detection, clone detection, and patch analysis. Recent studies have shown that the learning-based binary code embedding models perform better than the traditional feature-based approaches. However, previous studies have not delved deeply into the key factors that affect model performance. In this paper, we design extensive ablation studies to explore these influencing factors. The experimental results have provided us with many new insights. We have made innovations in both code representation and model selection: we propose a novel rich-semantic function representation technique to ensure the model captures the intricate nuances of binary code, and we introduce the first UniLM-based binary code embedding model, named UniASM, which includes two newly designed training tasks to learn representations of binary functions. The experimental results show that UniASM outperforms the state-of-the-art (SOTA) approaches on the evaluation datasets. The average scores of Recall@1 on cross-compilers, cross-optimization-levels, and cross-obfuscations have improved by 12.7%, 8.5%, and 22.3%, respectively, compared to the best of the baseline methods. Besides, in the real-world task of known vulnerability search, UniASM outperforms all the current baselines.