用于聚类的欧几里得空间中的核心：重要性采样几乎是最佳的

论文标题

用于聚类的欧几里得空间中的核心：重要性采样几乎是最佳的

Coresets for Clustering in Euclidean Spaces: Importance Sampling is Nearly Optimal

论文作者

Huang, Lingxiao, Vishnoi, Nisheeth K.

论文摘要

给定$ \ Mathbb {r}^d $中的$ n $点的集合，$（k，z）$ - 聚类问题的目标是找到$ k $“中心”的子集，以最小化$ z $ th powers of to to to to to to to to to to to to to to to to to to to to to to to to to to， $（k，z）$ - 聚类问题的特殊情况包括$ k $ - 米德式和$ k $ - ameans问题。我们的主要结果是一个统一的两阶段重要性采样框架，该框架构建了$（k，z）$ - 聚类问题的$ \ varepsilon $ -coreset。与[Feldman和Langberg，STOC 2011]中的$（K，Z）$ - 聚类的结果相比，我们的框架节省了$ \ Varepsilon^2 D $ factor coreSet尺寸。与[Sohler and Woodruff，Focs 2018]中的$（k，z）$ - 集群的结果相比，我们的框架节省了核心尺寸中的$ \ operatatorName {poly}（k）$ factor，并避免了$ \ exp（k/\ varepsilon）$在施工时间中。具体来说，我们的$ k $ -Median（$ z = 1 $）的核心具有$ \ tilde {o}（\ varepsilon^{ - 4} k）$，与[Sohler and Woodruff，Stoc，Stoc，2018年]相比，这将其保存在核尺寸中。我们的算法结果依赖于一种新的降低性降低技术，该技术连接了两个众所周知的形状拟合问题：子空间近似和聚类，并且可能具有独立的兴趣。 We also provide a size lower bound of $Ω\left(k\cdot \min \left\{2^{z/20},d \right\}\right)$ for a $0.01$-coreset for $(k,z)$-clustering, which has a linear dependence of size on $k$ and an exponential dependence on $z$ that matches our algorithmic results.

Given a collection of $n$ points in $\mathbb{R}^d$, the goal of the $(k,z)$-clustering problem is to find a subset of $k$ "centers" that minimizes the sum of the $z$-th powers of the Euclidean distance of each point to the closest center. Special cases of the $(k,z)$-clustering problem include the $k$-median and $k$-means problems. Our main result is a unified two-stage importance sampling framework that constructs an $\varepsilon$-coreset for the $(k,z)$-clustering problem. Compared to the results for $(k,z)$-clustering in [Feldman and Langberg, STOC 2011], our framework saves a $\varepsilon^2 d$ factor in the coreset size. Compared to the results for $(k,z)$-clustering in [Sohler and Woodruff, FOCS 2018], our framework saves a $\operatorname{poly}(k)$ factor in the coreset size and avoids the $\exp(k/\varepsilon)$ term in the construction time. Specifically, our coreset for $k$-median ($z=1$) has size $\tilde{O}(\varepsilon^{-4} k)$ which, when compared to the result in [Sohler and Woodruff, STOC 2018], saves a $k$ factor in the coreset size. Our algorithmic results rely on a new dimensionality reduction technique that connects two well-known shape fitting problems: subspace approximation and clustering, and may be of independent interest. We also provide a size lower bound of $Ω\left(k\cdot \min \left\{2^{z/20},d \right\}\right)$ for a $0.01$-coreset for $(k,z)$-clustering, which has a linear dependence of size on $k$ and an exponential dependence on $z$ that matches our algorithmic results.

下载PDF全文

下载文献需遵守相关版权规定

论文标题