在有效的低失真超嵌入

论文标题

在有效的低失真超嵌入

On Efficient Low Distortion Ultrametric Embedding

论文作者

Cohen-Addad, Vincent, S., Karthik C., Lagarde, Guillaume

论文摘要

无监督的学习和数据分析中的一个经典问题是，找到保留其基本属性的数据的简单易于访问性的表示。一种广泛使用的方法是保留数据的基本层次结构，同时降低其复杂性是将数据嵌入到树上或超级测量中。此任务最受欢迎的算法是经典的链接算法（单，平均或完整）。但是，这些方法在$ω（\ log n）$ dimensions的$ n $点的数据集上显示出非常刺激的运行时间为$θ（n^2）$。在本文中，我们提供了一种新的算法，该算法将一组点$ p $ in $ \ mathbb {r}^d $，并且对于每一个$ c \ ge 1 $，都在时间$ n^{1+\fracρ{c^2}} $（对于某种通用的常数$ p $ $ρ> 1 $）上，以$ $ $Δ在$ u $和$ v $之间的“最佳”超级表示$ p $ $δ（u，v）$的乘以$ 5C $的乘以$ 5C。在这里，最好的超量学是超级$ \tildeδ$，它可最大程度地减少相对于$ \ ell_2 $距离的最大距离失真，即，它最小化$ \ underSet {u，v \ in p} {\ axmax} {\ max} {\ max} \ \ \ \ \ \ \ \ \ \ frac {\tildeΔ（\tildeΔ（\tildeΔ）我们通过表明在普遍的复杂性理论假设下，对于每个常数$ \ varepsilon> 0 $，没有运行时间$ n^{2- \ varepsilon} $可以区分$ \ ell_ \ ell_ \ ell_ \ elfty iftty iSmetric iNPRAIMS a dertrapt in prifts n cortrapts，我们可以区分$ n^$ niff fimplation n pription，从而补充上述结果。最后，我们对经典机器学习数据集进行了经验评估，并表明我们的算法的输出与链接算法的输出相当，同时实现了更快的运行时间。

A classic problem in unsupervised learning and data analysis is to find simpler and easy-to-visualize representations of the data that preserve its essential properties. A widely-used method to preserve the underlying hierarchical structure of the data while reducing its complexity is to find an embedding of the data into a tree or an ultrametric. The most popular algorithms for this task are the classic linkage algorithms (single, average, or complete). However, these methods on a data set of $n$ points in $Ω(\log n)$ dimensions exhibit a quite prohibitive running time of $Θ(n^2)$. In this paper, we provide a new algorithm which takes as input a set of points $P$ in $\mathbb{R}^d$, and for every $c\ge 1$, runs in time $n^{1+\fracρ{c^2}}$ (for some universal constant $ρ>1$) to output an ultrametric $Δ$ such that for any two points $u,v$ in $P$, we have $Δ(u,v)$ is within a multiplicative factor of $5c$ to the distance between $u$ and $v$ in the "best" ultrametric representation of $P$. Here, the best ultrametric is the ultrametric $\tildeΔ$ that minimizes the maximum distance distortion with respect to the $\ell_2$ distance, namely that minimizes $\underset{u,v \in P}{\max}\ \frac{\tildeΔ(u,v)}{\|u-v\|_2}$. We complement the above result by showing that under popular complexity theoretic assumptions, for every constant $\varepsilon>0$, no algorithm with running time $n^{2-\varepsilon}$ can distinguish between inputs in $\ell_\infty$-metric that admit isometric embedding and those that incur a distortion of $\frac{3}{2}$. Finally, we present empirical evaluation on classic machine learning datasets and show that the output of our algorithm is comparable to the output of the linkage algorithms while achieving a much faster running time.

下载PDF全文

下载文献需遵守相关版权规定

论文标题