在线服务系统中反复出现故障的可操作且可解释的故障本地化

论文标题

在线服务系统中反复出现故障的可操作且可解释的故障本地化

Actionable and Interpretable Fault Localization for Recurring Failures in Online Service Systems

论文作者

Li, Zeyan, Zhao, Nengwen, Li, Mingjie, Lu, Xianglin, Wang, Lixin, Chang, Dongdong, Nie, Xiaohui, Cao, Li, Zhang, Wenzhi, Sui, Kaixin, Wang, Yanhua, Du, Xu, Duan, Guoqiang, Pei, Dan

论文摘要

由于其监视数据的大量量，多样性以及其组件内或内部的复杂依赖关系（例如，服务或数据库），故障定位在线服务系统中具有挑战性。此外，工程师要求故障本地化解决方案是可行和可解释的，现有研究方法无法满足。因此，共同的行业实践是，对于特定的在线服务系统，其经验丰富的工程师专注于基于对系统和历史失败的知识的重复失败的本地化。尽管上述常见的实践是可操作的和可解释的，但它在很大程度上是手动的，因此缓慢，有时不准确。在本文中，我们旨在通过机器学习来自动化这种练习。也就是说，我们提出了一种可操作且可解释的故障定位方法Dejavu，用于在线服务系统中的反复出现故障。对于特定的在线服务系统，Dejavu将系统中的历史失败和依赖项视为输入，并将本地化模型离线训练；对于传入的故障，受过训练的模型在线建议发生故障发生（即故障组件）以及发生哪种故障（即指示性指标群体）（因此可起诉），这些方法是通过全球和局部解释方法进一步解释的（因此可解释）。基于对三个生产系统和一个开源基准的601次失败的评估，在不到一秒钟的时间内，Dejavu平均可以将地面真相排在一个长期候选列表中的1.66-15至5.03-1.5.03-3中，超过了基线至少高于51.51％。

Fault localization is challenging in an online service system due to its monitoring data's large volume and variety and complex dependencies across or within its components (e.g., services or databases). Furthermore, engineers require fault localization solutions to be actionable and interpretable, which existing research approaches cannot satisfy. Therefore, the common industry practice is that, for a specific online service system, its experienced engineers focus on localization for recurring failures based on the knowledge accumulated about the system and historical failures. Although the above common practice is actionable and interpretable, it is largely manual, thus slow and sometimes inaccurate. In this paper, we aim to automate this practice through machine learning. That is, we propose an actionable and interpretable fault localization approach, DejaVu, for recurring failures in online service systems. For a specific online service system, DejaVu takes historical failures and dependencies in the system as input and trains a localization model offline; for an incoming failure, the trained model online recommends where the failure occurs (i.e., the faulty components) and which kind of failure occurs (i.e., the indicative group of metrics) (thus actionable), which are further interpreted by both global and local interpretation methods (thus interpretable). Based on the evaluation on 601 failures from three production systems and one open-source benchmark, in less than one second, DejaVu can on average rank the ground truths at 1.66-th to 5.03-th among a long candidate list, outperforming baselines by at least 51.51%.

下载PDF全文

下载文献需遵守相关版权规定

论文标题