查看论文信息

查看全文

查看论文信息

中文题名：	基于多组学数据的肺癌致癌基因及模块挖掘方法研究
姓名：	李蓬
保密级别：	公开
论文语种：	中文
学科代码：	081203
学科专业：	计算机应用技术
学生类型：	博士
学位：	工学博士
学位类型：	学术学位
学位年度：	2020
校区：	北京校区培养
学院：	人工智能学院
第一导师姓名：	孙波
第一导师单位：	北京师范大学人工智能学院
提交日期：	2021-04-21
答辩日期：	2020-06-30
外文题名：	Research on mining method for lung cancer-related genes and modules based on multi-omics data
中文关键词：	多组学数据 ; 癌症相关基因 ; 生存预测 ; 网络模型 ; 肺癌
外文关键词：	Multi-omics data ; Cancer-related genes ; Survival prediction ; Network model ; Lung cancer
中文摘要：	︿癌症是最危险的人类疾病，掌握癌症的发生和发展机制、发现能够识别癌症早期特征的生物标志物、预测癌症患者的生存状态等是癌症研究的重要工作。由于基因变异是正常细胞转化为癌细胞的主要原因，识别癌症发展过程中的关键基因及模块并理解其作用机制是当前癌症研究的一项主要挑战。作为一种复杂疾病，癌症的产生与多种因素有关。因此，基于多组学数据，利用模式识别和机器学习等方法，可以更全面地挖掘和识别影响癌症发展的各个因素。为此，本文以肺癌为例，开展基于多组学数据的癌症相关基因及模块挖掘方法的研究，主要完成了以下四个方面的研究工作：（1）基于基因表达数据和 PPI 网络的预后癌症相关基因识别方法为了深入了解癌症发展的规律和有效区分癌症患者所属的高低风险组，提出了一种利用随机游走算法和 Cox 比例风险模型以识别具有预后作用的癌症相关基因的方法。首先，选择已知的癌症基因作为 PPI 网络的种子基因，利用重启随机游走算法识别候选癌症相关基因。然后，使用单变量 Cox 比例风险模型，从基因表达数据中筛选癌症生存相关基因。最后，根据候选癌症相关基因和生存相关基因得到预后癌症相关基因。实验结果表明：本文提出方法得到的关键基因与癌症密切相关，且可以用作癌症预后标志物。（2）基于互信息和网络嵌入的癌症相关基因排序方法为了更好地理解癌症的分子作用机制和发现有效的治疗靶点，提出了一种基于互信息和网络嵌入的癌症相关基因排序方法。首先，利用互信息从基因表达数据中筛选差异表达基因作为候选基因。接着，将网络嵌入方法应用于 PPI 网络，根据设计的基因功能相似性函数，学习基因的低维向量表示。然后，以欧式距离作为距离度量，从候选基因中选择与已知癌症基因功能最相似的基因作为关键基因。最后，利用中介中心性，对关键基因排序，得到具有优先次序的癌症相关基因。实验结果表明：本文提出方法识别的基因与癌症的发生与发展密切相关。（3）基于多组学数据的癌症相关基因模块挖掘方法与个别癌症相关基因相比，将多基因模块作为生物标志物进行癌症的诊断和预后更加准确可靠。首先，基于基因组学数据筛选候选癌症相关基因。接着，为了解决生物网络中各种调控机制的非线性问题，提出了一种基于随机森林回归整合多组学数据的方法。该方法识别影响基因表达的关键调控因子，构建全基因组调控网络。然后，通过构造候选癌症相关基因的差异表达调控网络挖掘出它们所影响的异常调控基因集合。最后，考虑基因在 PPI 网络中的拓扑结构和生物信息，提出了一种基于基因功能网络挖掘关键基因模块的方法。该方法以基因功能相似度作为候选癌症相关基因的距离度量，使用密度聚类算法挖掘癌症相关基因模块。实验结果表明：本文提出的方法可以有效地识别癌症相关基因模块，得到的关键基因对癌症的发展具有重要影响，且与患者的生存密切相关。（4）基于多组学数据的癌症生存预测方法为了有效地识别癌症预后标志物和建立可靠的生存预测模型，提出了一种基于多组学数据，利用深度学习和特征选择技术的癌症生存预测方法 LCox&DCox_MDAE。首先，利用单变量Cox 比例风险模型分别从各组学数据中筛选癌症生存相关基因。接着，针对癌症的产生与多种因素有关，提出了一种基于自动编码器整合多组学数据的方法。然后，利用 Lasso-Cox 方法从候选基因中识别癌症预后基因。最后，为了建立非线性生存预测模型，组合深层神经网络和Cox 模型，构建深层 Cox 模型，以预测患者的生存状态。实验结果表明：本文提出方法得到的关键基因，具有良好的预后作用，可以作为癌症预后标志物，并且生存预测结果比仅使用单一组学数据和传统的生存预测方法具有更高的准确性。﹀
外文摘要：	︿ As cancer is the severest human disease, mastering the mechanism of cancer occurrence and development, discovering biomarkers that can identify early characteristics of cancer, and predicting the survival status of patients with cancer are important tasks in studies on cancer. Since genetic variation is the main reason that normal cells are transformed into cancer cells, identifying key genes and modules in the development of cancer, and mastering the mechanisms of cancer occurrence are a major challenge in studies on cancer. Cancer, as a complex disease, is associated with a variety of factors. So, using pattern recognition and machine learning, various factors influencing the development of cancer can be fully mined and identified based on multi-omics data. Therefore, the dissertation, taking lung cancer as the research object, carries out the research on mining method for cancer-related genes and modules based on multi-omics data. The four main research works are as follows: (1) A method for identification of cancer-related genes with prognostic role based on gene expression data and PPI network To understand the causes of cancer occurrence and development and effectively distinguish between high- and low-risk groups, a method is proposed to identify cancer-related genes with prognostic role via the random walk algorithm and Cox proportional hazards model. First, known cancer-related genes in protein-protein interaction networks are considered as seed genes, and the random walk with restart algorithm is used to identify candidate cancer-related genes. Then, using the univariant Cox proportional hazards model, gene expression data are screened to identify cancer survival-related genes. Finally, candidate cancer-related genes and survival-related genes are screened to identify prognostic cancer-related genes. The experimental results suggest that the key genes are related to cancer, and can be used as prognostic biomarkers for cancer. (2) A method for ranking cancer-related genes based on the mutual information and network embedding To understand the molecular mechanism of cancer and find effective therapeutic targets, a method is proposed to rank cancer-related genes based on the mutual information and network embedding. First, using the mutual information, gene expression data are screened to identify differentially expressed genes as candidate genes. Next, by leading the network embedding method in the PPI network, according to a designed functional similarity function, the low dimensional representation for each gene is learned. Then, using the Euclidean distance as a distance metric, the candidate genes with the most similar functions for the known cancer genes are selected as the key genes. Finally, according to the betweenness centrality, the key genes are ranked to obtain cancer-related genes with priority. The experimental results suggest that the identified genes are closely related to the occurrence and development of cancer. (3) A method for mining cancer-related gene modules based on multi-omics data Compared with individual cancer-related gene, it is more accurate and reliabe that multi-gene modules are used as biomarkers for cancer diagnosis and prognosis. First, genomics data are screened to obtain candidate cancer-related genes. Next, to solve the nonlinear problem of various regulatory mechanisms in biological networks，a method is proposed to integrate multi-omics data based on random forest regression. Key regulatory factors that affect gene expression are identified, and genome-wide regulatory networks are constructed. Then, the differential expression regulatory network of candidate cancer-related genes is constructed, and the dysregulated gene sets affected by candidate cancer-related genes are mined. Finally, the topology structure and biological information of genes in PPI network are considerd, a method is proposed to mine key gene modules based on gene function network. By introducing the gene functional similarity as a distance metric for candidate cancer-related genes, a density-based clustering algorithm is used to mine cancer-related gene modules. The experimental results suggest that the proposed method can effectively identify cancer-related gene modules, and the key genes have an important impact on the development of cancer and are closely related to patient survival. (4) A method for survival prediction of cancer based on multi-omics data To effectively identify prognostic biomarkers for cancer and establish a reliable survival prediction model, a multi-omics data based method named LCox&DCox_MDAE by using deep learning and feature selection technology is proposed to realize survival prediction of cancer. First, the univariant Cox proportional hazards model is performed on multi-omics data to obtain cancer survival-related genes. Next, as cancer is associated with a variety of factors, a method is proposed to integrate multi-omics data based on an autoencoder. Then, using Lasso-Cox method,candidate genes are screened to obtain prognostic genes for cancer. Finally, to construct a nonlinear survival prediction model, deep neural network and Cox model are combined to construct a deep Cox model to predict the survival status of patients. The experimental results suggest that the identified genes have a good prognostic role and can be used as prognostic biomarkers for cancer. Taken together, the proposed method has higher prediction accuracy than others, including the single omics data. ﹀
馆藏地：	图书馆学位论文阅览区（主馆南区三层BC区）
馆藏号：	博081203/20008
开放日期：	2022-04-21

附件下载