论文标题
使用依赖的MMD核心了解相关数据集的集合
Understanding collections of related datasets using dependent MMD coresets
论文作者
论文摘要
了解两个数据集的差异如何有助于我们确定一个数据集是否不足的代表某些子人群,并提供有关模型在整个数据集中概括的良好状态的见解。由最大平均差异(MMD)核心选择的代表点可以提供单个数据集的可解释摘要,但在数据集中不容易比较。在本文中,我们介绍了依赖的MMD核心,这是一种数据汇总方法,用于收集数据集,以促进分布比较。我们表明,依赖的MMD核心对于理解多个相关数据集并理解此类数据集之间的模型概括很有用。
Understanding how two datasets differ can help us determine whether one dataset under-represents certain sub-populations, and provides insights into how well models will generalize across datasets. Representative points selected by a maximum mean discrepency (MMD) coreset can provide interpretable summaries of a single dataset, but are not easily compared across datasets. In this paper we introduce dependent MMD coresets, a data summarization method for collections of datasets that facilitates comparison of distributions. We show that dependent MMD coresets are useful for understanding multiple related datasets and understanding model generalization between such datasets.