论文标题
通过增强蒸馏快速,准确且简单的表格数据模型
Fast, Accurate, and Simple Models for Tabular Data via Augmented Distillation
论文作者
论文摘要
自动化机器学习(AUTOML)可以通过堆叠,装袋和增强许多单独的模型(例如树木,深网和最近的邻居估计器)来产生复杂的模型集合。尽管高度准确,但与其成分相比,所得的预测变量很大,缓慢且不透明。为了改善Automl在表格数据上的部署,我们提出快速dad,以将任意复杂的集合预测变成单个模型,例如增强的树木,随机森林和深层网络。我们方法的核心是基于自我发项假估计量的Gibbs采样的数据增强策略。在跨越回归和二进制/多类分类任务的30个数据集中,快速蒸馏的单个模型比通过对原始数据的标准培训获得的单个模型要好得多。与H2O/AutoSklearn这样的汽车工具生成的集合预测变量,我们的单个蒸馏型型号超过10倍,更准确。
Automated machine learning (AutoML) can produce complex model ensembles by stacking, bagging, and boosting many individual models like trees, deep networks, and nearest neighbor estimators. While highly accurate, the resulting predictors are large, slow, and opaque as compared to their constituents. To improve the deployment of AutoML on tabular data, we propose FAST-DAD to distill arbitrarily complex ensemble predictors into individual models like boosted trees, random forests, and deep networks. At the heart of our approach is a data augmentation strategy based on Gibbs sampling from a self-attention pseudolikelihood estimator. Across 30 datasets spanning regression and binary/multiclass classification tasks, FAST-DAD distillation produces significantly better individual models than one obtains through standard training on the original data. Our individual distilled models are over 10x faster and more accurate than ensemble predictors produced by AutoML tools like H2O/AutoSklearn.