算法通过素描：将数据转换为高斯随机设计

论文标题

算法通过素描：将数据转换为高斯随机设计

Algorithmic Gaussianization through Sketching: Converting Data into Sub-gaussian Random Designs

论文作者

Dereziński, Michał

论文摘要

算法高斯化是一种现象，当使用随机素描或采样方法产生较小的大型数据集的表示形式时，可能会出现的现象：对于某些任务，已经观察到这些素描的表示形式表现出许多可靠的性能特征，这些性能是在数据样本中出现的，当数据样本来自次级次级随机设计时，这是一个强大的统计统计模型的数据分布。但是，仅研究了这种现象的特定任务和指标，或者是依靠计算昂贵的方法。我们通过为平均值提供用于高斯数据分布的算法框架来解决这一问题，并证明可以有效构建与亚高斯随机设计几乎无法区分的数据草图（在总变化距离方面）。特别是，依靠最近引入的素描技术称为杠杆评分（少）嵌入，我们表明，一个人可以构造$ n \ times d $ d $ sketch a $ n \ times d $ d $ matrix $ a $ a $，其中$ n \ ll n $，其中几乎可以与sub-gaussian Design $ o $ o（\ nn n nnd nnd ns nd ns nd ns of a Sup-ussian discessian coptiansian） $ \ text {nnz}（a）$是$ a $中的非零条目的数量。结果，可以直接适用于我们的草图框架，可直接适应于高斯设计（例如，对于最小二乘和套索回归，低级别估计，低级别近似等）的估计量（例如，对于最小二乘和套索回归，相协方差估计，低级别的近似等），可直接适应我们的草图框架。我们通过素描最小二乘的新近似保证和其他示例来说明这一点。

Algorithmic Gaussianization is a phenomenon that can arise when using randomized sketching or sampling methods to produce smaller representations of large datasets: For certain tasks, these sketched representations have been observed to exhibit many robust performance characteristics that are known to occur when a data sample comes from a sub-gaussian random design, which is a powerful statistical model of data distributions. However, this phenomenon has only been studied for specific tasks and metrics, or by relying on computationally expensive methods. We address this by providing an algorithmic framework for gaussianizing data distributions via averaging, proving that it is possible to efficiently construct data sketches that are nearly indistinguishable (in terms of total variation distance) from sub-gaussian random designs. In particular, relying on a recently introduced sketching technique called Leverage Score Sparsified (LESS) embeddings, we show that one can construct an $n\times d$ sketch of an $N\times d$ matrix $A$, where $n\ll N$, that is nearly indistinguishable from a sub-gaussian design, in time $O(\text{nnz}(A)\log N + nd^2)$, where $\text{nnz}(A)$ is the number of non-zero entries in $A$. As a consequence, strong statistical guarantees and precise asymptotics available for the estimators produced from sub-gaussian designs (e.g., for least squares and Lasso regression, covariance estimation, low-rank approximation, etc.) can be straightforwardly adapted to our sketching framework. We illustrate this with a new approximation guarantee for sketched least squares, among other examples.

下载PDF全文

下载文献需遵守相关版权规定

论文标题