论文标题
在有效的低失真超嵌入
On Efficient Low Distortion Ultrametric Embedding
论文作者
论文摘要
无监督的学习和数据分析中的一个经典问题是,找到保留其基本属性的数据的简单易于访问性的表示。一种广泛使用的方法是保留数据的基本层次结构,同时降低其复杂性是将数据嵌入到树上或超级测量中。此任务最受欢迎的算法是经典的链接算法(单,平均或完整)。但是,这些方法在$ω(\ log n)$ dimensions的$ n $点的数据集上显示出非常刺激的运行时间为$θ(n^2)$。 在本文中,我们提供了一种新的算法,该算法将一组点$ p $ in $ \ mathbb {r}^d $,并且对于每一个$ c \ ge 1 $,都在时间$ n^{1+\fracρ{c^2}} $(对于某种通用的常数$ p $ $ρ> 1 $)上,以$ $ $Δ在$ u $和$ v $之间的“最佳”超级表示$ p $ $δ(u,v)$的乘以$ 5C $的乘以$ 5C。在这里,最好的超量学是超级$ \tildeδ$,它可最大程度地减少相对于$ \ ell_2 $距离的最大距离失真,即,它最小化$ \ underSet {u,v \ in p} {\ axmax} {\ max} {\ max} \ \ \ \ \ \ \ \ \ \ frac {\tildeΔ(\tildeΔ(\tildeΔ) 我们通过表明在普遍的复杂性理论假设下,对于每个常数$ \ varepsilon> 0 $,没有运行时间$ n^{2- \ varepsilon} $可以区分$ \ ell_ \ ell_ \ ell_ \ elfty iftty iSmetric iNPRAIMS a dertrapt in prifts n cortrapts,我们可以区分$ n^$ niff fimplation n pription,从而补充上述结果。 最后,我们对经典机器学习数据集进行了经验评估,并表明我们的算法的输出与链接算法的输出相当,同时实现了更快的运行时间。
A classic problem in unsupervised learning and data analysis is to find simpler and easy-to-visualize representations of the data that preserve its essential properties. A widely-used method to preserve the underlying hierarchical structure of the data while reducing its complexity is to find an embedding of the data into a tree or an ultrametric. The most popular algorithms for this task are the classic linkage algorithms (single, average, or complete). However, these methods on a data set of $n$ points in $Ω(\log n)$ dimensions exhibit a quite prohibitive running time of $Θ(n^2)$. In this paper, we provide a new algorithm which takes as input a set of points $P$ in $\mathbb{R}^d$, and for every $c\ge 1$, runs in time $n^{1+\fracρ{c^2}}$ (for some universal constant $ρ>1$) to output an ultrametric $Δ$ such that for any two points $u,v$ in $P$, we have $Δ(u,v)$ is within a multiplicative factor of $5c$ to the distance between $u$ and $v$ in the "best" ultrametric representation of $P$. Here, the best ultrametric is the ultrametric $\tildeΔ$ that minimizes the maximum distance distortion with respect to the $\ell_2$ distance, namely that minimizes $\underset{u,v \in P}{\max}\ \frac{\tildeΔ(u,v)}{\|u-v\|_2}$. We complement the above result by showing that under popular complexity theoretic assumptions, for every constant $\varepsilon>0$, no algorithm with running time $n^{2-\varepsilon}$ can distinguish between inputs in $\ell_\infty$-metric that admit isometric embedding and those that incur a distortion of $\frac{3}{2}$. Finally, we present empirical evaluation on classic machine learning datasets and show that the output of our algorithm is comparable to the output of the linkage algorithms while achieving a much faster running time.