论文标题
用于聚类的欧几里得空间中的核心:重要性采样几乎是最佳的
Coresets for Clustering in Euclidean Spaces: Importance Sampling is Nearly Optimal
论文作者
论文摘要
给定$ \ Mathbb {r}^d $中的$ n $点的集合,$(k,z)$ - 聚类问题的目标是找到$ k $“中心”的子集,以最小化$ z $ th powers of to to to to to to to to to to to to to to to to to to to to to to to to to to, $(k,z)$ - 聚类问题的特殊情况包括$ k $ - 米德式和$ k $ - ameans问题。我们的主要结果是一个统一的两阶段重要性采样框架,该框架构建了$(k,z)$ - 聚类问题的$ \ varepsilon $ -coreset。与[Feldman和Langberg,STOC 2011]中的$(K,Z)$ - 聚类的结果相比,我们的框架节省了$ \ Varepsilon^2 D $ factor coreSet尺寸。与[Sohler and Woodruff,Focs 2018]中的$(k,z)$ - 集群的结果相比,我们的框架节省了核心尺寸中的$ \ operatatorName {poly}(k)$ factor,并避免了$ \ exp(k/\ varepsilon)$在施工时间中。具体来说,我们的$ k $ -Median($ z = 1 $)的核心具有$ \ tilde {o}(\ varepsilon^{ - 4} k)$,与[Sohler and Woodruff,Stoc,Stoc,2018年]相比,这将其保存在核尺寸中。我们的算法结果依赖于一种新的降低性降低技术,该技术连接了两个众所周知的形状拟合问题:子空间近似和聚类,并且可能具有独立的兴趣。 We also provide a size lower bound of $Ω\left(k\cdot \min \left\{2^{z/20},d \right\}\right)$ for a $0.01$-coreset for $(k,z)$-clustering, which has a linear dependence of size on $k$ and an exponential dependence on $z$ that matches our algorithmic results.
Given a collection of $n$ points in $\mathbb{R}^d$, the goal of the $(k,z)$-clustering problem is to find a subset of $k$ "centers" that minimizes the sum of the $z$-th powers of the Euclidean distance of each point to the closest center. Special cases of the $(k,z)$-clustering problem include the $k$-median and $k$-means problems. Our main result is a unified two-stage importance sampling framework that constructs an $\varepsilon$-coreset for the $(k,z)$-clustering problem. Compared to the results for $(k,z)$-clustering in [Feldman and Langberg, STOC 2011], our framework saves a $\varepsilon^2 d$ factor in the coreset size. Compared to the results for $(k,z)$-clustering in [Sohler and Woodruff, FOCS 2018], our framework saves a $\operatorname{poly}(k)$ factor in the coreset size and avoids the $\exp(k/\varepsilon)$ term in the construction time. Specifically, our coreset for $k$-median ($z=1$) has size $\tilde{O}(\varepsilon^{-4} k)$ which, when compared to the result in [Sohler and Woodruff, STOC 2018], saves a $k$ factor in the coreset size. Our algorithmic results rely on a new dimensionality reduction technique that connects two well-known shape fitting problems: subspace approximation and clustering, and may be of independent interest. We also provide a size lower bound of $Ω\left(k\cdot \min \left\{2^{z/20},d \right\}\right)$ for a $0.01$-coreset for $(k,z)$-clustering, which has a linear dependence of size on $k$ and an exponential dependence on $z$ that matches our algorithmic results.