染色体内基因组序列的信息和顺序如复杂性理论所确定的。综合方法

论文标题

染色体内基因组序列的信息和顺序如复杂性理论所确定的。综合方法

Information and order of genomic sequences within chromosomes as identified by complexity theory. An integrated methodology

论文作者

Karakatsanis, L. P., Pavlos, E. G., Tsoulouhas, G., Stamokostas, G. L., Mosbruger, T. L., Duke, J. L., Pavlos, G. P., Monos, D. S.

论文摘要

复杂性指标和机器学习（ML）模型已被用来分析22个人类染色体中的每个人中的每个中的每一个，分析了节段性基因组实体的长度：外显子，内含子，基因间和重复/独特的DNA序列。该研究的目的是评估可能隐藏在这些序列大小分布中的信息和顺序。为此，我们开发了一种创新的综合方法。我们的分析基于重建的相空间定理，TSALLIS，ML技术和新的技术索引的非扩展统计理论，并整合了生成的信息，我们介绍并将其命名为复杂性因子（COFA）。 DNA序列的低维确定性非线性混沌和非扩展统计特征通过强大的多型特征和远距离相关性验证，每个基因组实体和每个染色体都有显着变化。分析的结果揭示了每个基因组实体的复杂行为变化和有关单个基因组段的大小分布的染色体的变化。内含子区域的长度在所有指标中都显示出比外显子的长度更大的复杂性行为，并且对所有染色体具有更长的范围相关性和更强的记忆效应。我们从分析得出的结论是，染色体内基因组区域的大小分布不是随机的，而是具有特征特征的特定模式，这些模式通过其复杂性特征在这里看到，并且根据复杂性理论是整个基因组动力学的一部分。从ML工具识别出聚类，分类和预测的DNA中信息冗余的动力学图片。

Complexity metrics and machine learning (ML) models have been utilized to analyze the lengths of segmental genomic entities like: exons, introns, intergenic and repeat/unique DNA sequences, in each of the 22 human chromosomes. The purpose of the study was to assess information and order that may be concealed within the size distribution of these sequences. For this purpose, we developed an innovative integrated methodology. Our analysis is based upon the reconstructed phase space theorem, the non-extensive statistical theory of Tsallis, ML techniques and a new technical index, integrating the generated information, which we introduce and named it Complexity Factor (COFA). The low-dimensional deterministic nonlinear chaotic and non-extensive statistical character of the DNA sequences was verified with strong multifractal characteristics and long-range correlations with significant variations per genomic entity and per chromosome. The results of the analysis reveal changes in complexity behavior per genomic entity and chromosome regarding the size distribution of individual genomic segment. The lengths of intron regions show greater complexity behavior in all metrics than the exonic ones, with longer range correlations, and stronger memory effects, for all chromosomes. We conclude from our analysis, that the size distribution of the genomic regions within chromosomes, are not random, but follow a specific pattern with characteristic features, that have been seen here through its complexity character, and it is part of the dynamics of the whole genome according to complexity theory. This picture of dynamics of the redundancy of information in DNA recognized from ML tools for clustering, classification and prediction.

下载PDF全文

下载文献需遵守相关版权规定

论文标题