- 无标题文档
查看论文信息

中文题名:

 基于RaSE的肝细胞癌特征基因筛选与疾病预测    

姓名:

 黄鹂鸣    

保密级别:

 公开    

论文语种:

 chi    

学科代码:

 025200    

学科专业:

 应用统计    

学生类型:

 硕士    

学位:

 应用统计硕士    

学位类型:

 专业学位    

学位年度:

 2024    

校区:

 珠海校区培养    

学院:

 统计学院    

研究方向:

 数据科学与管理    

第一导师姓名:

 陈菲菲    

第一导师单位:

 统计学院    

提交日期:

 2024-06-09    

答辩日期:

 2024-05-25    

外文题名:

 Hepatocellular Carcinoma Signature Gene Screening and Disease Prediction Using Random Subspace Ensemble    

中文关键词:

 特征基因 ; 肝细胞癌 ; 随机子空间集成 ; 机器学习    

外文关键词:

 Signature Gene ; HCC ; Random Subspace Ensemble ; Machine Learning    

中文摘要:

肝细胞癌(Hepatocellular Carcinoma, HCC)是最常见的原发性肝癌,具有很高的发病率和死亡率。基因表达谱技术的出现,使得研究者能够对癌症组织和癌旁组织中的差异基因进行分析,筛选出与HCC相关的特征基因,这对于癌症的早期检测、预后和治疗来说具有重要意义。相比于传统统计方法,机器学习模型由于其较高的准确度,在特征基因选择领域发挥着越来越重要的作用。本研究采用前沿变量选择方法随机子空间集成(Random Subspace Ensemble, RaSE)进行HCC特征基因选择,并构建分类模型,对疾病进行分类预测。
经过尝试,本研究采用支持向量机(Support Vector Machine)模型配合RaSE框架(形成RaSE-SVM方法)。本研究选取了三种常见的特征基因选择方法进行对比,分别为随机森林、LASSO和SVM-RFE。为了保证对比方法的运行,对数据集中全部的34262个基因用差异分析进行初步筛选,得到5159个差异基因用于特征基因的选择。由于样本量较小,为了便于对四种方法进行比较,重复7次随机划分数据集进行训练:汇总7次特征选择与分类预测结果的平均值作为最终结果,将7次RaSE-SVM选择的特征基因重叠部分作为最终特征基因,最终得到15个重要特征基因。
分类预测的结果显示,RaSE-SVM方法的四个评价指标:准确率(0.9541)、精确率(0.9529)、召回率(0.9610)、F1 Score(0.9558)在所有方法中均为最优,且对于不同的数据集划分具有稳定的效果。其ROC曲线的AUC达到0.932,说明模型具有良好的分类性能。研究的结果表明RaSE特征选择方法在一定程度上改善了经典方法的缺陷,模型良好的效果证明了所得特征基因的有效性。15个重要特征基因可供学者参考,进行后续的预后预测与病理分析,探究其影响HCC进程的具体途径。
 

外文摘要:

Hepatocellular Carcinoma (HCC) is the most common primary liver cancer with high morbidity and mortality. The emergence of gene expression profiling technology has enabled researchers to analyze differential genes in cancer tissues and paracancerous tissues to screen for characteristic genes associated with HCC, which is important for early detection, prognosis and treatment of cancer. Compared to traditional statistical methods, machine learning models are playing an increasingly important role in the field of signature gene selection due to their higher accuracy. In this study, we used the cutting-edge variable selection method Random Subspace Ensemble (RaSE) for HCC feature gene selection and constructed a classification model to classify and predict the disease.
In this study, HCC gene expression data from GEO (Gene Expression Omnibus data base) database were used for analysis, containing 70 tumor tissue samples and 70 non-tumor tissue samples. In this study, the Support Vector Machine (SVM) model with RaSE framework (to form the RaSE-SVM method) was used for feature gene selection, and three common feature gene selection methods were selected for comparison, namely, Random Forest, LASSO, and SVM-RFE.In order to ensure that the comparative methods were run, the data set with all the 34,262 genes were preliminarily screened with variance analysis, and 5,159 variance genes were obtained for feature gene selection. Due to the small sample size of the test set, different methods are prone to get the same accuracy, so the randomized division of the dataset is repeated several times for training. Finally, the average of multiple feature selection and classification prediction results were summarized as the final result, and the overlap of multiple selections was used as the final feature genes.
Translated with www.DeepL.com/Translator (free version)The results of classification prediction show that the four evaluation indexes of RaSE-SVM method: accuracy (0.9541), precision (0.9529), recall (0.9610), and F1 Score (0.9558) are optimal among all the methods, and it has a stable effect for different dataset divisions. The AUC of its ROC curve reaches 0.932, indicating that the model has good classification performance. The results of the study show that the RaSE feature selection method improves the defects of the classical methods to a certain extent, and the good effect of the model proves the validity of the resulting feature genes.15 important feature genes can be used as reference for scholars to carry out subsequent prognostic prediction and pathological analysis, and to explore the specific pathways through which they influence the process of HCC.
 

参考文献总数:

 36    

馆藏地:

 总馆B301    

馆藏号:

 硕025200/24019Z    

开放日期:

 2025-06-11    

无标题文档

   建议浏览器: 谷歌 360请用极速模式,双核浏览器请用极速模式