论文标题
DNA合成的批次优化
Batch Optimization for DNA Synthesis
论文作者
论文摘要
大量合成DNA分子最近已被用于可靠地存储大量数字数据。尽管DNA作为存储介质具有较高的存储密度,但由于其可用DNA合成技术的高成本和低吞吐量,其实际用途目前受到严重限制。我们研究了批处理优化在降低大规模DNA合成成本中的作用,这转化为以下算法任务。给定一个固定长度的随机第四纪字符串的大池$ \ MATHCAL {s} $,分区$ \ Mathcal {s} $以批量划分,以最大程度地减少批次跨批量的最短常见超级股权的长度的总和。我们介绍了两个用于批次优化的想法,它们在天真的基线上都以不同的方式进行了改进:(1)使用$(ACGT)^{*} $及其反向$(TGCA)^{*} $作为参考链,以及适当的批量批量,以及(2)通过适当的订单的量化列来进行批量批量。我们还证明,DNA合成成本渐近地匹配下限,这表明人们无法改善这两个想法。我们的结果揭示了在DNA数据存储中自然出现的两种情况之间的令人惊讶的分离:在$ \ Mathcal {s} $中的字符串不包含相同性格(同源物)的重复(与$ \ nath Calcal incance n incection相比,渐近)的渐近成本可节省批处理优化的成本明显更大。
Large pools of synthetic DNA molecules have been recently used to reliably store significant volumes of digital data. While DNA as a storage medium has enormous potential because of its high storage density, its practical use is currently severely limited because of the high cost and low throughput of available DNA synthesis technologies. We study the role of batch optimization in reducing the cost of large scale DNA synthesis, which translates to the following algorithmic task. Given a large pool $\mathcal{S}$ of random quaternary strings of fixed length, partition $\mathcal{S}$ into batches in a way that minimizes the sum of the lengths of the shortest common supersequences across batches. We introduce two ideas for batch optimization that both improve (in different ways) upon a naive baseline: (1) using both $(ACGT)^{*}$ and its reverse $(TGCA)^{*}$ as reference strands, and batching appropriately, and (2) batching via the quantiles of an appropriate ordering of the strands. We also prove asymptotically matching lower bounds on the cost of DNA synthesis, showing that one cannot improve upon these two ideas. Our results uncover a surprising separation between two cases that naturally arise in the context of DNA data storage: the asymptotic cost savings of batch optimization are significantly greater in the case where strings in $\mathcal{S}$ do not contain repeats of the same character (homopolymers), as compared to the case where strings in $\mathcal{S}$ are unconstrained.