在可扩展的熵违反机器学习中过度拟合的障碍物中

论文标题

在可扩展的熵违反机器学习中过度拟合的障碍物中

On a scalable entropic breaching of the overfitting barrier in machine learning

论文作者

Horenko, Illia

论文摘要

过度拟合和处理“小数据”是机器学习（ML）中最具挑战性的问题之一，当相对较小的数据统计尺寸$ t $不足以为相对较大的数据功能尺寸$ d $提供强大的ML时。展示了针对不同的$ d $和$ t $的通用分类问题的大规模平行ML分析，为常见的ML方法存在统计学上很明显的线性过度障碍。例如，这些结果表明，对于长期短期内存深度学习分类器（LSTM）的生物信息学动机问题的鲁棒分类，在最佳情况下，一个人需要的统计信息$ t $至少比功能尺寸$ d $大的13.8倍。结果表明，这种过度拟合的障碍物可以通过熵 - 最佳的可伸缩概率近似算法（ESPA）以$ 10^{-12} $的计算成本的分数而违反，执行了熵 - 优于最佳的贝叶斯网络网络的参与和特征太空群进行问题的联合解决方案。与标准的生物信息学工具相比，ESPA在实验性单细胞RNA测序数据中的应用表现出30倍的分类性能提高 - 与深度学习LSTM分类器相比，相比之下。

Overfitting and treatment of "small data" are among the most challenging problems in the machine learning (ML), when a relatively small data statistics size $T$ is not enough to provide a robust ML fit for a relatively large data feature dimension $D$. Deploying a massively-parallel ML analysis of generic classification problems for different $D$ and $T$, existence of statistically-significant linear overfitting barriers for common ML methods is demonstrated. For example, these results reveal that for a robust classification of bioinformatics-motivated generic problems with the Long Short-Term Memory deep learning classifier (LSTM) one needs in a best case a statistics $T$ that is at least 13.8 times larger then the feature dimension $D$. It is shown that this overfitting barrier can be breached at a $10^{-12}$ fraction of the computational cost by means of the entropy-optimal Scalable Probabilistic Approximations algorithm (eSPA), performing a joint solution of the entropy-optimal Bayesian network inference and feature space segmentation problems. Application of eSPA to experimental single cell RNA sequencing data exhibits a 30-fold classification performance boost when compared to standard bioinformatics tools - and a 7-fold boost when compared to the deep learning LSTM classifier.

下载PDF全文

下载文献需遵守相关版权规定

论文标题