使用LSTM网络近似汇总的SQL查询

论文标题

使用LSTM网络近似汇总的SQL查询

Approximating Aggregated SQL Queries With LSTM Networks

论文作者

Regev, Nir, Rokach, Lior, Shabtai, Asaf

论文摘要

尽管对数据技术进行了持续的投资，但查询数据的延迟仍然带来了重大挑战。现代分析解决方案需要几乎实时响应能力，以使它们互动并支持自动处理。当前技术（Hadoop，Spark，DataFlow）扫描数据集以执行查询。他们专注于提供可扩展的数据存储以最大化任务执行速度。我们认为这些解决方案无法提供足够的交互性，因为它们取决于不断访问数据。在本文中，我们提出了一种查询近似方法，也称为近似查询处理（AQP），该方法减少了在推理过程中扫描数据的需求（查询计算），从而启用了快速查询处理工具。我们使用LSTM网络来学习查询及其结果之间的关系，并提供一个快速的推理层来预测查询结果。我们的方法（称为``hunch''）产生一个轻巧的LSTM网络，可提供高查询吞吐量。我们使用十二个数据集评估了我们的方法，并将其与最先进的AQP发动机（verdictDB，blinkdb）从查询延迟，模型重量和准确性角度进行了比较。结果表明，我们的方法通过标准化的均方根误差（NRMSE）预测了查询的结果，范围从大约1 \％到4 \％，在我们的大多数数据集中，比比较的基准更好。此外，我们的方法能够在一秒钟内预测多达120,000个查询（一起流），并且单个查询延迟不超过2ms。

Despite continuous investments in data technologies, the latency of querying data still poses a significant challenge. Modern analytic solutions require near real-time responsiveness both to make them interactive and to support automated processing. Current technologies (Hadoop, Spark, Dataflow) scan the dataset to execute queries. They focus on providing a scalable data storage to maximize task execution speed. We argue that these solutions fail to offer an adequate level of interactivity since they depend on continual access to data. In this paper we present a method for query approximation, also known as approximate query processing (AQP), that reduce the need to scan data during inference (query calculation), thus enabling a rapid query processing tool. We use LSTM network to learn the relationship between queries and their results, and to provide a rapid inference layer for predicting query results. Our method (referred as ``Hunch``) produces a lightweight LSTM network which provides a high query throughput. We evaluated our method using twelve datasets and compared to state-of-the-art AQP engines (VerdictDB, BlinkDB) from query latency, model weight and accuracy perspectives. The results show that our method predicted queries' results with a normalized root mean squared error (NRMSE) ranging from approximately 1\% to 4\% which in the majority of our data sets was better then the compared benchmarks. Moreover, our method was able to predict up to 120,000 queries in a second (streamed together), and with a single query latency of no more than 2ms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题