论文标题
使用加权陶氏复合物对格式方言的统计检测
Statistical detection of format dialects using the weighted Dowker complex
论文作者
论文摘要
本文在一组预先存在的解析器消耗的情况下,提供了一个经过实验验证的文件行为概率模型。文件行为是通过读取文件时产生的标准化布尔值“消息”的方式来衡量的。通过阈值表现出特定消息集的文件的后验概率来自特定方言,我们的模型得出了两个方言的实用分类算法。我们证明,可以通过主要由一个方言组成的训练组来引导两个方言的阈值算法。 (参数)理论和(非参数)对一个方言的文件行为的经验分布都会产生良好的分类性能,并基于简单地计数消息的分类效果良好。 我们的理论框架依赖于每个方言中消息的统计独立性。违反此假设的行为是可以检测到的,并允许格式分析师识别方言之间的“边界”。因此,格式分析师可以大大减少在制定方言检测的新标准时需要考虑的文件数量,因为它们仅需要考虑显示出模棱两可的消息模式的文件。
This paper provides an experimentally validated, probabilistic model of file behavior when consumed by a set of pre-existing parsers. File behavior is measured by way of a standardized set of Boolean "messages" produced as the files are read. By thresholding the posterior probability that a file exhibiting a particular set of messages is from a particular dialect, our model yields a practical classification algorithm for two dialects. We demonstrate that this thresholding algorithm for two dialects can be bootstrapped from a training set consisting primarily of one dialect. Both the (parametric) theoretical and the (non-parametric) empirical distributions of file behaviors for one dialect yield good classification performance, and outperform classification based on simply counting messages. Our theoretical framework relies on statistical independence of messages within each dialect. Violations of this assumption are detectable and allow a format analyst to identify "boundaries" between dialects. A format analyst can therefore greatly reduce the number of files they need to consider when crafting new criteria for dialect detection, since they need only consider the files that exhibit ambiguous message patterns.