论文标题
基于塑形的多个稳定学习的理论和算法
Theory and Algorithms for Shapelet-based Multiple-Instance Learning
论文作者
论文摘要
我们提出了一种新的公式学习(MIL)的新表述,其中一个数据单元由一组称为袋子的实例组成。目的是根据与“塑形”(或图案)的相似性找到一个好的袋子分类器,其中带有塑形的袋子的相似性是袋子中实例的最大相似之处。在以前的工作中,某些培训实例被选为塑形,没有理论上的理由。在我们的表述中,我们使用所有可能的形状,从而无限地形状,从而产生更丰富的分类器。我们表明,该公式是可进行的,也就是说,可以通过线性编程提升(LPBoost)降低有限(实际上是多项式)大小的凸(DC)程序的差异。我们的理论结果还为以前一些作品的启发式法提供了理由。所提出的算法的时间复杂性高度取决于训练样本中所有实例集的大小。要应用于包含大量实例的数据,我们还提出了该算法的启发式选项,而不会丢失理论保证。我们的实证研究表明,我们的算法统一地针对时间序列分类和各种MIL任务的塑形学习任务,其精度与现有方法相当。此外,我们表明提出的启发式方法使我们能够通过合理的计算时间实现结果。
We propose a new formulation of Multiple-Instance Learning (MIL), in which a unit of data consists of a set of instances called a bag. The goal is to find a good classifier of bags based on the similarity with a "shapelet" (or pattern), where the similarity of a bag with a shapelet is the maximum similarity of instances in the bag. In previous work, some of the training instances are chosen as shapelets with no theoretical justification. In our formulation, we use all possible, and thus infinitely many shapelets, resulting in a richer class of classifiers. We show that the formulation is tractable, that is, it can be reduced through Linear Programming Boosting (LPBoost) to Difference of Convex (DC) programs of finite (actually polynomial) size. Our theoretical result also gives justification to the heuristics of some of the previous work. The time complexity of the proposed algorithm highly depends on the size of the set of all instances in the training sample. To apply to the data containing a large number of instances, we also propose a heuristic option of the algorithm without the loss of the theoretical guarantee. Our empirical study demonstrates that our algorithm uniformly works for Shapelet Learning tasks on time-series classification and various MIL tasks with comparable accuracy to the existing methods. Moreover, we show that the proposed heuristics allow us to achieve the result with reasonable computational time.