- 无标题文档
查看论文信息

中文题名:

 基于过采样方法和神经网络模型对肝细胞癌患者的生存分析的改善    

姓名:

 杨昕冉    

保密级别:

 公开    

论文语种:

 中文    

学科代码:

 025200    

学科专业:

 应用统计    

学生类型:

 硕士    

学位:

 应用统计硕士    

学位类型:

 专业学位    

学位年度:

 2020    

校区:

 北京校区培养    

学院:

 统计学院/国民核算研究院    

研究方向:

 生存预测    

第一导师姓名:

 李慧    

第一导师单位:

 北京师范大学统计学院    

提交日期:

 2020-06-22    

答辩日期:

 2020-05-25    

外文题名:

 Improvement of Survival Prediction for Hcc Patients Based on Oversampling and Neural Network    

中文关键词:

 神经网络 ; 过采样 ; 生存预测 ; 肝细胞癌    

外文关键词:

 Neural network ; Oversampling ; Survival prediction ; Hepatic cellular cancer    

中文摘要:

在过去的几年中,全球范围内癌症发病率和相关死亡人数不断增长,而肝癌是目前全球第六大最常被诊断出的癌症,并且是因癌症及其相关因素致死的第二大原因。肝细胞癌(HCC)占原发性肝癌的90%以上,是全球最常见的肿瘤之一。肝细胞癌(Hepatic Cellular Cancer,HCC)是指起源于肝脏上皮细胞的癌症,大约90%的肝细胞癌与已知的潜在危险因素有关。最常见的因素包括慢性病毒性肝炎(乙型或丙型)和肝硬化。关于这两种肝炎病毒,其相应的主要标志物都涉及特定抗原和抗体的检测,而肝硬化通常用Child-Pugh(CP)评分进行评估,该评分采用了五种肝脏疾病的临床测量指标(总胆红素,白蛋白,脑病变,腹水和凝血酶原时间)。80%以上的肝细胞癌病例中均存在肝硬化情况,因此肝硬化也被确定为该病理的主要前体病变。

多年来,对于肝细胞癌的研究已经从个性化治疗方面着手,包括使用机器学习技术从临床数据中提取特征等,以协助临床医生进行更有针对性的决策。其中,神经网络算法由于其特有的学习能力,已越来越成为适合癌症研究的计算技术。通过神经网络对临床数据的分析,为很多正处于研究的疾病获得了宝贵的新结论。

但是,目前的研究也存在一些局限性:有的研究技术并不是针对肝细胞癌患者专门设计的,预测精度上无法保证;有的研究方法则有严格的适用范围,没有考虑到患者之间的异质性或数据缺失的问题。这是医疗相关数据分析中的常见缺陷,也是本文对肝细胞癌患者的数据分析方法进行改进的重要原因。在实际的肝细胞癌患者生存数据收集中,患者数据常出现样本缺失、数据不平衡及存在异质性等问题。

为了对肝细胞癌患者生存数据进行生存判别分析,本文先是对来自葡萄牙的肝细胞癌患者样本数据进行了指标分析及缺失值统计,并针对上述三种生存患者样本数据存在的常见问题,尝试在数据处理阶段、数据采样方法及分析模型三方面对传统生存判别分析进行改进。本文重点处理了样本数据存在的数据集不平衡、存在异质性和缺失值比例较高问题,并最终评价了预测的精确程度。根据评价结果,本文得出了针对样本数据较为良好的处理方法,即在HEOM插值法下,SMOTE过采样结合FAGABPNN的模型;同时也验证了这些改进是有效的。
外文摘要:

Over the past few years, the incidence rate of cancer and related deaths has been increasing worldwide. Liver cancer is the sixth most commonly diagnosed cancer in the world, and the second leading cause of death due to cancer and related factors. Hepatic cellular cancer(HCC) accounts for more than 90% of primary liver cancer and is one of the most common tumors in the world. HCC refers to cancer originated from liver epithelial cells. About 90% of HCC is related to known potential risk factors. The most common factors include chronic viral hepatitis (B or C) and cirrhosis. For these two kinds of hepatitis viruses, the corresponding main markers are related to the detection of specific antigens and antibodies, while liver cirrhosis is usually evaluated by child Pugh (CP) score, which uses the clinical measurement indicators of five liver diseases (total bilirubin, albumin, brain lesions, ascites and prothrombin time). More than 80% of HCC cases have cirrhosis, so cirrhosis is also identified as the main precursor of this pathology.

For many years, the research of HCC has been started from personalized treatment, including using machine learning technology to extract features from clinical data, so as to help clinicians make more targeted decisions. Among them, neural network algorithm has become more and more suitable for cancer research because of its unique learning ability. Through the analysis of clinical data by neural network, valuable new conclusions have been obtained for many diseases under study.

However, there are some limitations in the current research: some research techniques are not specially designed for patients with HCC, thus the prediction accuracy cannot be guaranteed; some research methods have strict scope of application, without considering the heterogeneity or data loss between patients. This is a common defect in the analysis of medical related data, and also an important reason for the improvement of data analysis methods in this paper. In the actual data collected from HCC patients, there are many problems such as sample missing, data imbalance and heterogeneity.

In order to carry out survival discriminant analysis on the survival data of patients with HCC, this paper makes index analysis and missing value statistics on the sample data of patients with HCC from Portugal in the beginning. Aiming at solving the common problems of the above three in HCC patient’s data, this paper attempts to change the traditional survival discriminant analysis in three aspects: data processing stage, data sampling method and analysis model Enter. This paper focuses on the data imbalance, heterogeneity and high ratio of missing data in the sample. And then evaluates the accuracy of the prediction. According to the evaluation results, this paper obtains a better processing method for sample data which contains HEOM interpolation method, SMOTE sampling and FAGABPNN model, and also verifies the effectiveness of these improvements.

参考文献总数:

 19    

作者简介:

 杨昕冉(1996- ),女,四川省成都人,2014年至2018年就读于西南财经大学数学与应用数学专业,获得理学学位与经济学学位,2018年至2020年就读于北京师范大学应用统计专业,获得应用统计专业硕士,研究生毕业论文研究方向为生存预测。    

馆藏号:

 硕025200/20050    

开放日期:

 2021-06-22    

无标题文档

   建议浏览器: 谷歌 360请用极速模式,双核浏览器请用极速模式