锤子图和计算生物学应用的公制维度

论文标题

锤子图和计算生物学应用的公制维度

Metric Dimension of Hamming Graphs and Applications to Computational Biology

论文作者

Laird, Lucas

论文摘要

遗传测序已成为计算生物学中越来越负担得起且易于获得的基因组数据来源。这些数据通常表示为$ k $ -mers，即某些固定长度$ k $的字符串，并从参考字母中选择的符号。相反，一些最有效，最有研究的机器学习算法需要数据的数值表示。所谓的锤子图的度量尺寸的概念提出了一种解决此问题的有希望的方法。当这些顶点的距离独特地表征图中的每个顶点时，图中的一个顶点被认为正在解决。图的度量尺寸是顶点的最小分辨子集的大小。找到一般图的指标维度是一个具有挑战性的问题，实际上是NP完整的。最近，已经提出了一种在锤图中查找解决方案集的有效算法，这足以将$ k $ - mers唯一地嵌入实际矢量空间中。由于嵌入的尺寸是关联解析集的基础性，因此确定是否可以从解决方案集中删除节点，同时保持其解决方案是引起极大的兴趣。对于大图而言，这可能是非常具有挑战性的，因为只有一种检查集合是否是解决集合的蛮力方法。在本文中，我们表征了锤态图在有限域上的线性系统的分解性：当且仅当线性系统仅在上述域上具有微不足道的解决方案时，一组节点才能解决。我们可以将域表示为多项式系统的根部，因此Gröbner碱基的设备派上用场，以确定是否解决了一组节点。作为概念证明，我们研究了与八肽相关的锤态图的可分辨性，即长度为8的蛋白质序列。

Genetic sequencing has become an increasingly affordable and accessible source of genomic data in computational biology. This data is often represented as $k$-mers, i.e., strings of some fixed length $k$ with symbols chosen from a reference alphabet. In contrast, some of the most effective and well-studied machine learning algorithms require numerical representations of the data. The concept of metric dimension of the so-called Hamming graphs presents a promising way to address this issue. A subset of vertices in a graph is said to be resolving when the distances to those vertices uniquely characterize every vertex in the graph. The metric dimension of a graph is the size of a smallest resolving subset of vertices. Finding the metric dimension of a general graph is a challenging problem, NP-complete in fact. Recently, an efficient algorithm for finding resolving sets in Hamming graphs has been proposed, which suffices to uniquely embed $k$-mers into a real vector space. Since the dimension of the embedding is the cardinality of the associated resolving set, determining whether or not a node can be removed from a resolving set while keeping it resolving is of great interest. This can be quite challenging for large graphs since only a brute-force approach is known for checking whether a set is a resolving set or not. In this thesis, we characterize resolvability of Hamming graphs in terms of a linear system over a finite domain: a set of nodes is resolving if and only if the linear system has only a trivial solution over said domain. We can represent the domain as the roots of a polynomial system so the apparatus of Gröbner bases comes in handy to determine, whether or not a set of nodes is resolving. As proof of concept, we study the resolvability of Hamming graphs associated with octapeptides i.e. proteins sequences of length eight.

下载PDF全文

下载文献需遵守相关版权规定

论文标题