论文标题

迈向数据湖的架构推断

Towards Schema Inference for Data Lakes

论文作者

Alhammad, Nour, Bogatu, Alex, Paton, Norman W

论文摘要

数据湖是具有未来分析潜力的数据存储库。但是,两者都发现数据湖中的数据和探索相关数据集都可能需要大量努力,因为数据湖可能包含令人生畏的异质数据。在本文中,我们建议使用模式推理来支持数据湖中数据的解释。如果数据湖要支持读取的模式的范式,则了解数据湖相关部分的现有模式似乎是前提条件。在本文中,我们利用可用于数据发现的近似索引来告知数据湖模式的推断,该索引由实体类型及其之间的关系组成。该特定方法通过从数据湖中聚类相似的数据集来标识候选实体类型,然后使用不同集群中的数据集之间的关系来告知实体类型之间关系的识别。使用现实世界数据存储库评估该方法,以确定提案的有效位置,并告知对进一步工作的区域的识别。

A data lake is a repository of data with potential for future analysis. However, both discovering what data is in a data lake and exploring related data sets can take significant effort, as a data lake can contain an intimidating amount of heterogeneous data. In this paper, we propose the use of schema inference to support the interpretation of the data in the data lake. If a data lake is to support a schema-on-read paradigm, understanding the existing schema of relevant portions of the data lake seems like a prerequisite. In this paper, we make use of approximate indexes that can be used for data discovery to inform the inference of a schema for a data lake, consisting of entity types and the relationships between them. The specific approach identifies candidate entity types by clustering similar data sets from the data lake, and then relationships between data sets in different clusters are used to inform the identification of relationships between the entity types. The approach is evaluated using real-world data repositories, to identify where the proposal is effective, and to inform the identification of areas for further work.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源