论文标题
实用的KMP/BM样式模式匹配不确定字符串
Practical KMP/BM Style Pattern-Matching on Indeterminate Strings
论文作者
论文摘要
在本文中,我们描述了两种简单,快速,太空效率的算法,用于查找不确定模式的所有匹配$ p = p = p [1..m] $ in Intererminate string $ x = x [1..n] $,其中$ p $和$ x $均定义在“小”订购的Alphabet $ $ falphabet $ falphabet $ - $σ$ - $σ$ - $σ=,$ n $ - $ - $ - $ - $ - $ - $ - $ - $ - $σ= | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | \ le 9 $。这两种算法都取决于一个预处理阶段,该阶段将$σ$替换为整数字母$σ_i$的尺寸$ $σ_i=σ$(可逆地,按时线性的时间线性)映射$ x $和$ p $ co x $ and $ p $在等价的$ y $ y $ y $ y $ y $ y $ y $ q $中,分别是$ Q $,最大值的peremer(分别是$ q $)的最高量(分别)。 (对于$σ\ le 4 $,因此对于DNA序列,一个8位表示就足够了)。我们首先描述了一种高效的版本KMP Indet,这是一种尊贵的Knuth-Morris-Pratt算法,以找到$ y $中的$ Q $的所有出现(即$ y $(即$ x $中的$ p $),但是,只要有必要,使用前缀阵列,而不是边境阵列,以控制转换的模式$ Q $ Q $ Q $ Q $ Q $ y the Transp y Y $ y $ y $ y $ y $ y $。我们继续描述了Boyer-Moore算法的类似有效版本的BM Indet,该版本在广泛的测试用例中的执行速度明显快于KMP Indet。一个值得注意的功能是,这两种算法都需要很少的额外空间:$θ(m)$单词。我们猜想,类似的方法可能会产生与其他众所周知的模式匹配算法的实践和有效不确定的等效物,尤其是Boyer-Moore的几种变体。
In this paper we describe two simple, fast, space-efficient algorithms for finding all matches of an indeterminate pattern $p = p[1..m]$ in an indeterminate string $x = x[1..n]$, where both $p$ and $x$ are defined on a "small" ordered alphabet $Σ$ $-$ say, $σ= |Σ| \le 9$. Both algorithms depend on a preprocessing phase that replaces $Σ$ by an integer alphabet $Σ_I$ of size $σ_I = σ$ which (reversibly, in time linear in string length) maps both $x$ and $p$ into equivalent regular strings $y$ and $q$, respectively, on $Σ_I$, whose maximum (indeterminate) letter can be expressed in a 32-bit word (for $σ\le 4$, thus for DNA sequences, an 8-bit representation suffices). We first describe an efficient version KMP Indet of the venerable Knuth-Morris-Pratt algorithm to find all occurrences of $q$ in $y$ (that is, of $p$ in $x$), but, whenever necessary, using the prefix array, rather than the border array, to control shifts of the transformed pattern $q$ along the transformed string $y$. We go on to describe a similar efficient version BM Indet of the Boyer- Moore algorithm that turns out to execute significantly faster than KMP Indet over a wide range of test cases. A noteworthy feature is that both algorithms require very little additional space: $Θ(m)$ words. We conjecture that a similar approach may yield practical and efficient indeterminate equivalents to other well-known pattern-matching algorithms, in particular the several variants of Boyer-Moore.