论文标题
彩色字符串中的图案发现
Pattern Discovery in Colored Strings
论文作者
论文摘要
在本文中,我们考虑了识别有色字符串感兴趣模式的问题。彩色字符串是一个字符串,每个位置都分配了一组有限的颜色之一。我们的任务是找到始终发生的彩色字符串的子字符串,然后在相同的距离处具有相同的颜色。该问题是由于嵌入式系统验证中的应用,特别是断言挖掘的动机。那里的目标是通过对其仿真轨迹的分析自动找到嵌入式系统的属性。 我们表明,在我们的设置中,感兴趣的模式数量由$ \ Mathcal {o}(n^2)$限制,其中$ n $是字符串的长度。我们引入了一种基线算法,以$ \ Mathcal {o}(n^2)$时间运行,该算法标识了字符串中所有颜色的所有感兴趣的感兴趣模式,以满足某些最小值条件。对于仅对一种与一种颜色相关的模式感兴趣的情况,我们还提供了第二种算法,该算法在最坏的情况下以$ \ Mathcal {o}(n^2 \ log n)$时间运行,但在实践中比基线算法更快。两种解决方案都使用后缀树,第二个算法也使用适当定义的优先级队列,这使我们能够减少计算数量。我们对合成和现实世界数据集进行了对所提出方法的实验评估,发现第二算法的表现优于所有模拟数据上的第一个算法,而在现实世界数据上,性能在略有放缓(在数据集的一半上)和最新因素的速度较小(在数据集的一半上)而变化。
In this paper, we consider the problem of identifying patterns of interest in colored strings. A colored string is a string where each position is assigned one of a finite set of colors. Our task is to find substrings of the colored string that always occur followed by the same color at the same distance. The problem is motivated by applications in embedded systems verification, in particular, assertion mining. The goal there is to automatically find properties of the embedded system from the analysis of its simulation traces. We show that, in our setting, the number of patterns of interest is upper-bounded by $\mathcal{O}(n^2)$, where $n$ is the length of the string. We introduce a baseline algorithm, running in $\mathcal{O}(n^2)$ time, which identifies all patterns of interest satisfying certain minimality conditions, for all colors in the string. For the case where one is interested in patterns related to one color only, we also provide a second algorithm which runs in $\mathcal{O}(n^2\log n)$ time in the worst case but is faster than the baseline algorithm in practice. Both solutions use suffix trees, and the second algorithm also uses an appropriately defined priority queue, which allows us to reduce the number of computations. We performed an experimental evaluation of the proposed approaches over both synthetic and real-world datasets, and found that the second algorithm outperforms the first algorithm on all simulated data, while on the real-world data, the performance varies between a slight slowdown (on half of the datasets) and a speedup by a factor of up to 11.