论文标题
在不清理的情况下学习肮脏的数据
Learning Over Dirty Data Without Cleaning
论文作者
论文摘要
现实世界数据集很脏,包含许多错误。这些问题的示例是违反代表数据值和实体的完整性约束,重复和不一致的行为。通过肮脏的数据库学习可能导致模型不正确。用户必须花费大量时间和精力来修复数据错误并创建一个干净的学习数据库。此外,由于修复这些错误所需的信息常常不可用,因此肮脏的数据库可能会有许多可能的清洁版本。我们提出了Dlearn,这是一种新型的关系学习系统,可以直接在肮脏的数据库中有效,有效地学习,而无需进行任何预处理。 Dlearn利用数据库约束来学习准确的关系模型,而不是不一致和异构数据。它的学习模型代表了以可用形式的所有可能清洁实例的模式。我们的经验研究表明,Dlearn有效地了解了大型现实世界数据库的准确模型。
Real-world datasets are dirty and contain many errors. Examples of these issues are violations of integrity constraints, duplicates, and inconsistencies in representing data values and entities. Learning over dirty databases may result in inaccurate models. Users have to spend a great deal of time and effort to repair data errors and create a clean database for learning. Moreover, as the information required to repair these errors is not often available, there may be numerous possible clean versions for a dirty database. We propose DLearn, a novel relational learning system that learns directly over dirty databases effectively and efficiently without any preprocessing. DLearn leverages database constraints to learn accurate relational models over inconsistent and heterogeneous data. Its learned models represent patterns over all possible clean instances of the data in a usable form. Our empirical study indicates that DLearn learns accurate models over large real-world databases efficiently.