论文标题

基于多维模式对SNV生物标志物的识别和验证

Identification and Validation of the SNV Biomarkers Based on Multi-Dimensional Patterns

论文作者

Li, Bo, Zhang, Junying, Yu, Liang

论文摘要

背景:单核苷酸变体(SNV)被检测到不同类型的癌症患者的DNA样品的不同分布。即使,选择适当的方法以最大程度的SNV鉴定癌症是一项严格的任务。结果:在本文中,我们提出了一个基于不同特征维度的SNV模式的生物标志物概念。由TCGA(癌症基因组地图集)获得了由十二种不同癌症组成的RAW数据集(2761个样品)。在样品中对562,321个DNA突变位点进行初步筛选后,提取突变位点并在六个不同的SNV特征维度中以癌症类型进行表征。在这项研究中,我们发现提取的特征在样品的疾病类型的簇中心中显示出相似的分布。在原始数据进行初始处理后,样本更加集中于SNV水平的癌症或癌症的亚型分布。我们使用K-nearest邻居(KNN)对提取的特征进行了分类,并保留一口交叉对其进行了验证。分类的准确性稳定在97%左右,达到97.43%。在验证阶段,我们发现了在九种癌症中最重要的特征的基因座中有验证的癌基因。结论:总而言之,样品根据其所属的癌症显示一致的模式。通过分布SNV的不同维度分布并具有很高的精度,可以对样品的癌症进行分类。并可能对发现癌症的基因的发现具有潜在的影响。

Background: Single nucleotide variants (SNVs) are detected as different distributions of DNA samples of distinct types of cancer patients. Even though, it is an exacting task to select the appropriate method to identify cancer to the greatest extent of SNVs. Results: In this paper, we proposed a biomarker concept based on SNV patterns in different feature dimensions. Raw dataset (2761 samples) consisting of twelve different cancers was obtained from TCGA (The Cancer Genome Atlas). After preliminary screening of 562,321 DNA mutation sites in the samples, the mutation sites were extracted and characterized by cancer types in six different SNV feature dimensions. In this study, we found that the extracted features showed similar distribution in the cluster center of the disease type of the samples. After the initial processing of the raw data, the sample was more focused on the subtype distribution of the cancer or the cancer at the SNV level. We used k-nearest neighbors (KNN) to classify the extracted features and Leave-One-Out cross verified them. The accuracy of classifying is stable at around 97% and reached 97.43% at the highest. During the validation phase, we found validated oncogenes in the loci of the features with the highest importance among nine cancers. Conclusions: In summary, the samples showed consistent patterns according to the cancer in which it belongs. It is feasible to classify the cancer of the sample by the distribution of different dimensions of the SNVs and has a high accuracy. And has potential implications for the discovery of cancer-causing genes.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源