一个新的基于哈希的最近的邻居选择技术，用于大数据集

论文标题

一个新的基于哈希的最近的邻居选择技术，用于大数据集

A new hashing based nearest neighbors selection technique for big datasets

论文作者

Tchaye-Kondi, Jude, Zhai, Yanlong, Zhu, Liehuang

论文摘要

KNN的声誉是用于分类或回归的最简单但有效的监督学习算法。 KNN预测效率很大程度上取决于其培训数据的大小，但是当此培训数据增长时，KNN在做出决策时会遭受缓慢的影响，因为它需要在每个决策中搜索整个数据集中的最近的邻居。本文提出了一种新技术，该技术可以直接在给定观察的附近选择最近的邻居。所提出的方法包括将数据空间划分为在数据空间之上构建的虚拟网格的子单元。使用哈希进行数据点和子字节之间的映射。当涉及到给定观察结果的最近邻居时，我们首先通过使用哈希来识别观测值所属于的单元，然后我们从该中心单元中寻找最近的邻居，并逐层查找其周围的细胞和细胞。从我们对公开可用数据集的实验性能分析，我们的算法的表现优于原始KNN的时间效率，其预测质量与KNN的预测质量一样好，它还可以通过KDTREE等解决方案提供竞争性能

KNN has the reputation to be the word simplest but efficient supervised learning algorithm used for either classification or regression. KNN prediction efficiency highly depends on the size of its training data but when this training data grows KNN suffers from slowness in making decisions since it needs to search nearest neighbors within the entire dataset at each decision making. This paper proposes a new technique that enables the selection of nearest neighbors directly in the neighborhood of a given observation. The proposed approach consists of dividing the data space into subcells of a virtual grid built on top of data space. The mapping between the data points and subcells is performed using hashing. When it comes to select the nearest neighbors of a given observation, we firstly identify the cell the observation belongs by using hashing, and then we look for nearest neighbors from that central cell and cells around it layer by layer. From our experiment performance analysis on publicly available datasets, our algorithm outperforms the original KNN in time efficiency with a prediction quality as good as that of KNN it also offers competitive performance with solutions like KDtree

下载PDF全文

下载文献需遵守相关版权规定

论文标题