论文标题

诊断:迈向通用的互联网规模的根本原因分析解决方案

DiagNet: towards a generic, Internet-scale root cause analysis solution

论文作者

Bonniot, Loïck, Neumann, Christoph, Taïani, François

论文摘要

对于内容提供商和ISP而言,互联网规模服务中的诊断问题仍然特别困难且昂贵。由于Internet已分散,因此此类问题的原因可能位于最终用户的设备和服务数据中心之间。此外,一系列可能的问题和原因是不事先知道的,因此在实践中不可能培训各种问题,原因和位置的分类器。在本文中,我们探讨了如何使用从最终用户设备进行的测量来将不同的机器学习技术用于互联网规模的根本原因分析。我们展示了如何构建(i)对基础网络拓扑不可知的通用模型,(ii)不需要在培训期间定义一组可能的原因,并且(iii)可以快速适应以诊断新服务。我们的解决方案,诊断,从图像处理研究中调整概念来处理网络和系统指标。我们评估了带有注入故障的在线服务的多云部署诊断,并模仿了自动浏览器的客户。我们证明了有希望的根本原因分析能力,召回了73.9%,包括仅在推理时引入原因。

Diagnosing problems in Internet-scale services remains particularly difficult and costly for both content providers and ISPs. Because the Internet is decentralized, the cause of such problems might lie anywhere between an end-user's device and the service datacenters. Further, the set of possible problems and causes is not known in advance, making it impossible in practice to train a classifier with all combinations of problems, causes and locations. In this paper, we explore how different machine learning techniques can be used for Internet-scale root cause analysis using measurements taken from end-user devices. We show how to build generic models that (i) are agnostic to the underlying network topology, (ii) do not require to define the full set of possible causes during training, and (iii) can be quickly adapted to diagnose new services. Our solution, DiagNet, adapts concepts from image processing research to handle network and system metrics. We evaluate DiagNet with a multi-cloud deployment of online services with injected faults and emulated clients with automated browsers. We demonstrate promising root cause analysis capabilities, with a recall of 73.9% including causes only being introduced at inference time.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源