论文标题
Select-Protonet:学习选择几种疾病亚型预测
Select-ProtoNet: Learning to Select for Few-Shot Disease Subtype Prediction
论文作者
论文摘要
当前的机器学习在计算机视觉上取得了长足的进步,许多其他领域都归因于大量高质量的培训样本,而在基因组数据分析上的工作不太好,因为众所周知,它们被称为小数据。在我们的工作中,我们专注于几个疾病亚型预测问题,确定类似患者的亚组,这些子组可以通过培训小数据来指导特定人的治疗决策。实际上,医生和临床医生总是通过同时研究几个相互关联的临床变量来解决这个问题。我们试图模拟这种临床观点,并引入元学习技术来开发一种新模型,从而可以从相互关联的临床任务中提取常见的经验或知识,并将其转移以帮助解决新任务。我们的新模型建立在一个精心设计的元学习器上,称为原型网络,这是一台简单而有效的元学习机器,用于几个弹片图像分类。观察到基因表达数据与图像数据相比具有高维度和高噪声性能,我们提出了通过附加两个模块来解决这些问题的新扩展。具体而言,我们附加了一个特征选择层,以自动滤除与疾病相关的基因,并结合样品重新加权策略以适应性地删除嘈杂的数据,同时扩展模型能够从有限的培训示例中学习并良好地概括。模拟和实际基因表达数据实验证实了所提出的方法预测疾病亚型和鉴定潜在疾病相关基因的优越性。
Current machine learning has made great progress on computer vision and many other fields attributed to the large amount of high-quality training samples, while it does not work very well on genomic data analysis, since they are notoriously known as small data. In our work, we focus on few-shot disease subtype prediction problem, identifying subgroups of similar patients that can guide treatment decisions for a specific individual through training on small data. In fact, doctors and clinicians always address this problem by studying several interrelated clinical variables simultaneously. We attempt to simulate such clinical perspective, and introduce meta learning techniques to develop a new model, which can extract the common experience or knowledge from interrelated clinical tasks and transfer it to help address new tasks. Our new model is built upon a carefully designed meta-learner, called Prototypical Network, that is a simple yet effective meta learning machine for few-shot image classification. Observing that gene expression data have specifically high dimensionality and high noise properties compared with image data, we proposed a new extension of it by appending two modules to address these issues. Concretely, we append a feature selection layer to automatically filter out the disease-irrelated genes and incorporate a sample reweighting strategy to adaptively remove noisy data, and meanwhile the extended model is capable of learning from a limited number of training examples and generalize well. Simulations and real gene expression data experiments substantiate the superiority of the proposed method for predicting the subtypes of disease and identifying potential disease-related genes.