论文标题

在工业云系统中表征和减轻警报的反patestns

Characterizing and Mitigating Anti-patterns of Alerts in Industrial Cloud Systems

论文作者

Yang, Tianyi, Shen, Jiacheng, Su, Yuxin, Ren, Xiaoxue, Yang, Yongqiang, Lyu, Michael R.

论文摘要

警报对于要求对云异常的迅速干预至关重要。警报的质量显着影响云的可靠性和云提供商的业务收入。在实践中,我们观察到呼叫工程师由于存在误导性,非信息性,不可用的警报而无法快速定位和修复故障的云服务。我们将警报的无效性称为“警报的反pates”。为了更好地理解警报的反图案,并提供了可行的措施来减轻反模式,在本文中,我们对减轻工业云系统中警报的反图案的实践进行了首次实证研究。我们研究领先的云提供商华为云的警报策略和警报处理程序。我们的研究结合了对两年来数百万警报的定量分析,并与十八位经验丰富的工程师进行了调查。结果,我们总结了四个单独的反pates和两个集体反图案的警报。我们还总结了四个当前反应,以减轻警报的反图案,以及关于警报策略配置的一般预防指南。最后,我们建议探索对警报质量(QOA)的自动评估,包括警报的指示性,精度和可操作性,作为未来的研究方向,有助于自动检测警报的反图案。我们研究的发现对于优化云监测系统和提高云服务的可靠性非常有价值。

Alerts are crucial for requesting prompt human intervention upon cloud anomalies. The quality of alerts significantly affects the cloud reliability and the cloud provider's business revenue. In practice, we observe on-call engineers being hindered from quickly locating and fixing faulty cloud services because of the vast existence of misleading, non-informative, non-actionable alerts. We call the ineffectiveness of alerts "anti-patterns of alerts". To better understand the anti-patterns of alerts and provide actionable measures to mitigate anti-patterns, in this paper, we conduct the first empirical study on the practices of mitigating anti-patterns of alerts in an industrial cloud system. We study the alert strategies and the alert processing procedure at Huawei Cloud, a leading cloud provider. Our study combines the quantitative analysis of millions of alerts in two years and a survey with eighteen experienced engineers. As a result, we summarized four individual anti-patterns and two collective anti-patterns of alerts. We also summarize four current reactions to mitigate the anti-patterns of alerts, and the general preventative guidelines for the configuration of alert strategy. Lastly, we propose to explore the automatic evaluation of the Quality of Alerts (QoA), including the indicativeness, precision, and handleability of alerts, as a future research direction that assists in the automatic detection of alerts' anti-patterns. The findings of our study are valuable for optimizing cloud monitoring systems and improving the reliability of cloud services.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源