模拟用于测试大规模实体分辨率的名称的向量

论文标题

模拟用于测试大规模实体分辨率的名称的向量

Simulating Name-like Vectors for Testing Large-scale Entity Resolution

论文作者

Herath, Samudra, Roughan, Matthew, Glonek, Gary

论文摘要

数十年来，准确有效的实体分辨率（ER）一直是数据分析和数据挖掘项目的问题。在我们的工作中，我们有兴趣开发ER方法来处理大数据。良好的公共数据集在该区域受到限制，通常尺寸很小。仿真是一种用于生成用于测试数据集的技术。现有的仿真工具具有复杂性，可扩展性和重新采样限制的问题。我们通过引入更好的方法来解决这些问题，以模拟大数据ER的测试数据。我们提出的仿真模型简单，便宜且快速。我们专注于使用简单的向量表示避免对记录进行详细级别的模拟。在本文中，我们将讨论如何模拟近似名称属性的简单向量（通常用作标识密钥）。

Accurate and efficient entity resolution (ER) has been a problem in data analysis and data mining projects for decades. In our work, we are interested in developing ER methods to handle big data. Good public datasets are restricted in this area and usually small in size. Simulation is one technique for generating datasets for testing. Existing simulation tools have problems of complexity, scalability and limitations of resampling. We address these problems by introducing a better way of simulating testing data for big data ER. Our proposed simulation model is simple, inexpensive and fast. We focus on avoiding the detail-level simulation of records using a simple vector representation. In this paper, we will discuss how to simulate simple vectors that approximate the properties of names (commonly used as identification keys).

下载PDF全文

下载文献需遵守相关版权规定

论文标题