论文标题
在云上迁移数百万查询的旅程
Journey of Migrating Millions of Queries on The Cloud
论文作者
论文摘要
宝藏数据每天在云上处理数百万个分布式SQL查询。在此规模上升级查询引擎服务是具有挑战性的,因为我们需要将客户的所有生产查询迁移到新版本,同时保留数据处理管道的正确性和性能。为了确保查询引擎的质量,我们利用查询日志来构建特定于客户的基准测试,并在安全的预生产环境中使用真实的客户数据重播这些查询。为了模拟数百万个查询,我们需要有效地最大程度地减少测试查询集,并更好地报告仿真结果,以主动找到新版本的不兼容的更改和性能回归。本文描述了我们系统的整体设计,并在维持云上查询引擎服务质量方面有各种挑战。
Treasure Data is processing millions of distributed SQL queries every day on the cloud. Upgrading the query engine service at this scale is challenging because we need to migrate all of the production queries of the customers to a new version while preserving the correctness and performance of the data processing pipelines. To ensure the quality of the query engines, we utilize our query logs to build customer-specific benchmarks and replay these queries with real customer data in a secure pre-production environment. To simulate millions of queries, we need effective minimization of test query sets and better reporting of the simulation results to proactively find incompatible changes and performance regression of the new version. This paper describes the overall design of our system and shares various challenges in maintaining the quality of the query engine service on the cloud.