论文标题

在共享的HPC体系结构上部署碎片的MongoDB群集作为排队的工作

Deploying a sharded MongoDB cluster as a queued job on a shared HPC architecture

论文作者

Saxton, Aaron, Squaire, Stephen

论文摘要

数据存储是基础数据科学构建的所有变体。它们为结构化和非结构化数据提供了可查询的接口。数据科学通常首先利用这些查询功能执行初始数据准备。但是,大多数数据存储旨在连续运行,以几乎没有停机时间或没有停机时间为不同的用户请求。许多HPC体系结构通过作业队列调度程序处理用户请求,并维护碎片文件系统以存储工作持久数据。我们使用一个运行脚本部署了一个MongoDB碎片集群,该脚本旨在同时运行数据科学工作负载。作为我们的测试作品,我们运行数据摄入和数据查询,以在蓝色水域晚餐计算机上使用不同的配置来测量性能。

Data stores are the foundation on which data science, in all its variations, is built upon. They provide a queryable interface to structured and unstructured data. Data science often starts by leveraging these query features to perform initial data preparation. However, most data stores are designed to run continuously to service disparate user requests with little or no downtime. Many HPC architectures process user requests by job queue scheduler and maintain a shard filesystem to store a jobs persistent data. We deploy a MongoDB sharded cluster with a run script that is designed to run a data science workload concurrently. As our test piece, we run data ingest and data queries to measure the performance with different configurations on the Blue Waters supper computer.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源