论文标题
Amalur:数据集成满足机器学习
Amalur: Data Integration Meets Machine Learning
论文作者
论文摘要
机器学习所需的数据(ML)模型培训可以驻留在通常称为数据孤岛的不同单独站点中。对于数据密集型ML应用程序,数据筒仓构成了一个主要挑战:数据的集成和转换需要大量的手动工作和计算资源。借助数据隐私和安全性约束,数据通常无法离开本地站点,并且必须以分散的方式对模型进行培训。在这项工作中,我们提出了有关如何桥接传统数据集成(DI)技术与现代机器学习要求的愿景。我们探讨了利用从数据集成过程获得的元数据来提高ML模型的有效性和效率的可能性。我们分析了两个常见用例,分析了数据孤岛,功能增强和联合学习。将数据集成和机器学习融合在一起,我们从系统,表示,分解学习和联合学习的方面重点介绍了新的研究机会。
The data needed for machine learning (ML) model training, can reside in different separate sites often termed data silos. For data-intensive ML applications, data silos pose a major challenge: the integration and transformation of data demand a lot of manual work and computational resources. With data privacy and security constraints, data often cannot leave the local sites, and a model has to be trained in a decentralized manner. In this work, we present a vision on how to bridge the traditional data integration (DI) techniques with the requirements of modern machine learning. We explore the possibilities of utilizing metadata obtained from data integration processes for improving the effectiveness and efficiency of ML models. We analyze two common use cases over data silos, feature augmentation and federated learning. Bringing data integration and machine learning together, we highlight the new research opportunities from the aspects of systems, representations, factorized learning and federated learning.