- 无标题文档
查看论文信息

中文题名:

 不依赖参考基因组的可变剪切识别的研究    

姓名:

 张泉宝    

保密级别:

 公开    

论文语种:

 chi    

学科代码:

 071300    

学科专业:

 生态学    

学生类型:

 博士    

学位:

 理学博士    

学位类型:

 学术学位    

学位年度:

 2023    

校区:

 北京校区培养    

学院:

 生命科学学院    

研究方向:

 生物信息学    

第一导师姓名:

 庞尔丽    

第一导师单位:

 生命科学学院    

提交日期:

 2023-06-08    

答辩日期:

 2023-06-04    

外文题名:

 REFERENCE-FREE PREDICTION OF ALTERNATIVE SPLICING EVENTS IN A TRANSCRIPTOME    

中文关键词:

 无参考基因组 ; 转录组 ; 可变剪切 ; 混合k-mer ; 着色的de Bruijn图 ; 卷积神经网络 ; XGBoost ; 网页服务器    

外文关键词:

 Reference-free ; Transcriptome ; Alternative splicing ; Mixed k-mer colored de Bruijn graph ; Attention-based CNN ; XGBoost ; Laravel    

中文摘要:

可变剪切(alternative splicing)是指同一基因座产生不同的转录本的过程,作为真核生物重要的转录后修饰机制,不但极大提高了转录组和蛋白质组的多样性,而且也是表型多样性的重要来源,对生物的环境适应和进化也发挥了重要的作用。测序技术的发展为可变剪切的研究提供了有力的工具,但是对于没有参考基因组的物种来说,基因结构信息的缺失极大地限制了其可变剪切的研究。如何在不依赖参考基因组的情况下,仅利用转录组序列来准确全面地识别全基因组范围内的可变剪切是在非模式物种中展开可变剪切研究的前提条件。为此,本研究提出了仅利用转录组序列,分别基于线性序列比对和基于图的两种策略,来识别可变剪切事件;通过基于序列的深度学习和基于特征矩阵的传统机器学习两种分类模型,对可变剪切事件进行类别的划分;并把我们的算法应用到三桠乌药的二代测序数据和无油樟的三代测序数据中;最后将我们的四种算法以网页服务器的形式供相关研究工作者使用。具体结果如下:
(1)根据转录本序列的特点,本研究提出了基于BLASTN序列比对的识别方法。通过分析相邻高得分片段(HSP)的位置关系判断是否发生可变剪切事件。通过该方法我们可以识别到外显子跳跃、可变的受体位点、可变的供体位点和内含子保留四类可变剪切事件,在人类和拟南芥中识别的精度分别是91.30%和94.70%,召回率分别为80.32%和82.50%;与已有的方法比较,我们方法在精度相似的情况下,极大的提升了识别的召回率。
(2)基于BLASTN比对的识别方法不能识别可变的第一个外显子、可变的最后外显子和互斥外显子三类可变剪切事件,为此本研究首次引入了混合k-mer的着色de Bruijn 图来识别可变剪切事件。将序列转化为点线连接的图结构,通过识别三种不同拓扑结构,可以更精准地识别序列之间的多种差异,更准确的区分了可变剪切转录本和同源基因转录本,所以极大地提高了识别的准确性。在人类和拟南芥的数据中,精度分别达到了98.17%和99.31%,召回率也达到了93.45%和95.34%;与已有的方法比较,我们的方法不但能够识别所有的7类可变剪切事件类型,而且精度和召回率都得到了提升。
(3)上面的两个识别算法中,外显子跳跃、可变的受体位点、可变的供体位点和内含子保留四类事件由于识别模式和拓扑结构相似,无法区分其事件类型。为此本研究利用可变剪切位点上下游的序列,并探索了剪切位点附近50bp的序列满足信息熵的要求,训练基于attention机制的卷积神经网络(CNN)多分类模型进行事件类型的划分。本研究分别利用人类和拟南芥可变剪切数据集训练了两个模型。经过独立数据集评估,模型的精度分别达到88.49%和90.73%。与已有的模型比较,本研究的模型表现更好。最后还探究了CNN模型的可解释性,发现事件类型与GT-AG规则有关。
(4)由于深度学习模型的解释性差,本研究又根据可变剪切区域及其上下游序列,搜集与可变剪切事件发生的文献,构建了包含结构特征、序列特征和功能特征的1982维的特征矩阵,首次引入motif作为功能特征。训练了XGBoost多分类模型来预测事件类型。经过独立数据集评估,人类和拟南芥模型精度分别达93.44%和94.09%;其表现不但好于之前的传统机器学习模型,而且也好于上述深度学习模型。本研究进一步说明当研究问题的先验知识足够多时,考虑到模型的解释性问题,传统的机器学习模型可能是一种更好的选择。
(5)本研究还将上述四个算法分别应用到实际的转录组测序数据中。首先将上述识别算法和分类模型应用到无油樟的三代测序全长转录本中,利用基于参考基因组的可变剪切数据集作验证,本研究的识别精度和召回率分别为98.71%和93.02%,都高于其他算法。在对三桠乌药的转录组分析中,在由二代测序数据从头拼接得到的全长转录组中,应用上述识别算法来识别可变剪切转录本并去除,保留精准的参考转录组。随后对三桠乌药南北方两个种群和同质园样本的进行差异表达分析和功能注释,发现木质素合成相关的基因和ERF相关的转录因子家族与南北方形态差异有关。
(6)本研究还开发了一个用户友好的动态交互在线网页服务器,将识别和分类算法整合到网站的分析流程中。用户只需上传全长转录本序列,选择合适的算法和参数,即可得到发生的可变剪切事件,并识别结果进行了在线展示并提供下载服务。
综上,本研究提出一个在无参考基因组的情况下,仅利用全长转录本序列,进行基因组范围内的可变剪切识别和分类的算法,具体包括两种识别算法和两种分类算法。利用人类、拟南芥两个模式物种的可变剪切数据集和无油樟、三桠乌药两个真实转录组测序数据进行了验证,本研究的算法都取得了非常好的结果,说明了我们算法不但准确性高而且通用性强。结合用户友好的在线分析平台,我们的研究将为非模式物种中利用可变剪切来研究表型的多样性、环境适应和进化提供有力的工具。
 

外文摘要:

As an important post transcriptional modification mechanism in eukaryotes, alternative splicing not only improves the diversity of transcriptome and proteome, but also is an important source of phenotypic diversity, playing an important role in environmental adaptation and evolution of organisms. The development of sequencing technology has provided powerful tools for the study of alternative splicing, but for species without reference genomes, the lack of gene structure information greatly limits the study of alternative splicing. How to accurately and comprehensively identify genome-wide alternative splicing using only transcriptome sequences without relying on the reference genome is the prerequisite for alternative splicing research in non model species. For this reason, this study proposes two strategies, namely sequence alignment based and de Bruijn graph based, to identify alternative splicing events using only transcriptome sequences; By using two classification models, sequence based deep learning and feature matrix based machine learning, alternative splicing events are classified; And our algorithm is applied to the transcriptome of the RNA-seq of Lindera obtusiloba and the Iso-Seq of Amborella; Finally, we provided our four algorithms in the form of web servers for all of the researchers to use. The specific results are as follows:
(1) Based on the characteristics of transcriptional sequences, this study proposes an identification method based on BLASTN alignment. Determine whether an alternative splicing event has occurred by analyzing the positional relationship between adjacent high scoring segments (HSPs). Through this method, we can identify four types of alternative splicing events: exon skipping, alternative acceptor sites, alternative donor sites, and intron retention. The identification accuracy in humans and Arabidopsis thaliana is 91.30% and 94.70%, respectively, with recall of 80.32% and 82.50%; Compared with existing methods, our method greatly improves the recall of identification with similar accuracy.
(2) The identification method based on BLASTN alignment cannot identify three types of alternative splicing events: alternative first exon, alternative last exon, and mutually exclusive exon. Therefore, this study introduces a mixed k-mer colored de Bruijn graph for the first time to identify alternative splicing events. Transforming the sequence into a graph structure connected by vertexes and edges, by identifying three different topological structures, not only can the three types of alternative splicing events mentioned above be identified, but also because the graph can more accurately identify multiple differences between sequences, distinguishing between alternative splicing transcripts and homologous gene transcripts, greatly improving the accuracy of identification. In human and Arabidopsis thaliana data, the accuracy reached 98.17% and 99.31%, respectively, and the recall also reached 93.45% and 95.34%; Compared with existing methods, our method not only identifys all 7 types of alternative splicing event types, but also improves accuracy and recall.
(3) In the two algorithms mentioned above, the four types of events, namely exon skipping, alternative receptor sites, alternative donor sites, and intron retention, cannot be specifically classified, whether based on sequence alignment or graph comparison methods. Therefore, this study utilizes sequences upstream and downstream of alternative splicing sites to train a convolutional neural network (CNN) multi classification model based on attention mechanism for event type classification. This study trained two models using human and Arabidopsis thaliana alternative splicing datasets to predict the types of alternative splicing events. The accuracy of the models was evaluated to reach 88.49% and 90.73%, respectively; Compared with existing models, we found that our model performs better; We also analyzed the interpretability of the model and found that it is related to GT-AG rule.
(4) Due to the poor interpretability of the deep learning model, we constructed a 1982-dimensional feature matrix containing structural features, sequence features, and functional features based on alternative splicing regions and their upstream and downstream sequences, and trained the XGBoost multi classification model to predict event types. By evaluating the accuracy of the model in humans and Arabidopsis thaliana, it reached 93.44% and 94.09%, respectively; Compared with existing models, our model not only performs better than previous traditional machine learning models, but also better than our previous deep learning models. Our research further demonstrates that when there is sufficient prior knowledge to study the problem, considering the interpretability of the model, traditional machine learning models may be a better choice.
(5) In this study, the above four algorithms were also applied to the actual transcriptome sequencing data. Firstly, the above identification algorithms and classification models were applied to the Iso-Seq full-length transcripts of Amborella, and validated using a reference genome based alternative splicing dataset. The identification accuracy and recall of this study were 98.71% and 93.02%, respectively, which were higher than other algorithms. In the transcriptome analysis of Lindera obtusiloba, in the full-length transcriptome obtained from the second generation sequencing data, the above identification algorithm is used to identify alternative splicing transcripts, and retain accurate reference transcriptome. Subsequently, differential expression analysis and functional annotation were conducted on two populations and homogeneous garden samples of Lindera obtusiloba in the north and south, and it was found that genes related to lignin synthesis and transcription factor families related to ERF were related to morphological differences in the north and south in China.
(6) This study also developed a user-friendly dynamic interactive online web server that integrates identification and classification algorithms into the analysis process of the website. Users only need to upload the full-length transcript sequence, select the appropriate algorithm and parameters, to obtain the alternative splicing events that occur, and provide online display and download services for these events.
In summary, this study proposes an algorithm for genome wide alternative splicing identification and classification using only full-length transcript sequences without a reference genome, including two identification algorithms and two classification algorithms. The alternative splicing dataset of two model species, human and Arabidopsis thaliana, and two real transcriptome sequencing data, non cinnamomum camphora and Sanya ebony, were used for verification. The algorithms in this study have achieved very good results, indicating that our algorithm is not only accurate but also versatile. Combined with a user-friendly online analysis platform, our research will provide powerful tools for utilizing alternative splicing to study phenotypic diversity, environmental adaptation, and evolution in non modal species.
 

参考文献总数:

 175    

作者简介:

 本研究主要工作是在不依赖参考基因组的情况下,仅利用转录组序列,对组学水平的可变剪切的识别,并进行事件类型的分类,在研究内容的主要工作完成之后,开始了博士论文的撰写工作。 在内容方面,按照背景介绍、实验内容、总结展望的结构展开,共完成八章研究内容的撰写。除背景介绍和展望外,其余六章都按照引言、数据与方法、结果与讨论和小结四部分为主体结构展开描写,尽可能做到结构清晰,逻辑通畅。 在规范性方面,根据《北京师范大学学位论文撰写规则(2015版)》的要求,对目录、标题、脚注、图表、公式等都规范其格式。尤其是参考文献,按照国家标准《文后参考文献著录规则》GB/T 7714-2005,逐条检查,并规范其格式。 在结果展示方面,为了直观地展示结果,除了文字描述之外,本论文使用了较多的图和表进行结果展示,而且在图中,尽可能做到前后色系一致。并且在中英文的图注和表头的描述上,尽可能做到简洁全面。 在论文的撰写过程中,离不开庞老师的专业性指导,也离不开师弟师妹对文章的错别字格式等问题细致勘误,最终共同完成9万字的博士研究生论文。     

馆藏地:

 图书馆学位论文阅览区(主馆南区三层BC区)    

馆藏号:

 博071300/23004    

开放日期:

 2024-06-21    

无标题文档

   建议浏览器: 谷歌 360请用极速模式,双核浏览器请用极速模式