论文标题
海鸥:负载预测和优化资源分配的基础设施
Seagull: An Infrastructure for Load Prediction and Optimized Resource Allocation
论文作者
论文摘要
Microsoft Azure致力于保证在高客户活动期间,在控制成本的同时,特别是在客户活动期间为其客户提供高质量的服务。我们采用数据科学(DS)驱动的解决方案来预测用户负载并利用这些预测以优化资源分配。为此,我们构建了海鸥基础架构,该基础架构可以处理人均遥测,验证数据,训练和部署ML模型。这些模型用于预测每台服务器的客户负载(未来24小时),并优化服务操作。海鸥不断重新评估预测的准确性,以前已知的良好模型的后备以及适当的触发警报。我们在所有Azure区域的PostgreSQL和MySQL服务器的生产中部署了此基础架构,并将其应用于在低负载时间内调度服务器备份的问题。这可以最大程度地减少对用户引起的负载的干扰,并改善客户体验。
Microsoft Azure is dedicated to guarantee high quality of service to its customers, in particular, during periods of high customer activity, while controlling cost. We employ a Data Science (DS) driven solution to predict user load and leverage these predictions to optimize resource allocation. To this end, we built the Seagull infrastructure that processes per-server telemetry, validates the data, trains and deploys ML models. The models are used to predict customer load per server (24h into the future), and optimize service operations. Seagull continually re-evaluates accuracy of predictions, fallback to previously known good models and triggers alerts as appropriate. We deployed this infrastructure in production for PostgreSQL and MySQL servers across all Azure regions, and applied it to the problem of scheduling server backups during low-load time. This minimizes interference with user-induced load and improves customer experience.