中文题名: | 利用野生黄瓜和栽培黄瓜共线性信息改善基因组注释 |
姓名: | |
学科代码: | 071020 |
学科专业: | |
学生类型: | 硕士 |
学位: | 理学硕士 |
学位年度: | 2013 |
校区: | |
学院: | |
研究方向: | 进化基因组学 |
第一导师姓名: | |
第一导师单位: | |
提交日期: | 2013-05-26 |
答辩日期: | 2013-05-23 |
中文摘要: |
基因组注释是一项多水平、多层次的过程,其质量的好坏直接决定了基因组序列的 价值。目前虽然有许多工具用于基因组注释并在原核生物中取得了很好的结果,但由于真 核基因结构的复杂性,使得任何一种注释方法都不能适用所有情况。于是基于证据整合的注释方法被广泛应用。随着测序技术的快速发展,近缘物种基因组序列越来越多,而近缘物种间都保留了祖先的大量信息,具有很好的保守性。通过比较近缘物种的基因组可以获得大片段的共线性区域,而这些区域内包含了丰富的同源信息,而这启示我们可以借助共线性信息来进一步提升注释准确性。 黄瓜是葫芦科的重要经济作物,也是研究性别决定机制的模式生物。09年栽培黄瓜完成了全基因组测序,为了与其进行比较研究,野生黄瓜也完成了测序(未公布)。在本研究中,我们选用栽培黄瓜和野生黄瓜作为研究材料,基于本实验室搭建的植物基因组注释平台以及它们的共线性信息对两个近缘种的基因组注释进行了改善。最终,在野生黄瓜中新注释出了909个基因,栽培中853个基因;在野生黄瓜中结合RNA-Seq等信息,发现了87例ORF较长的基因被错误注释成多个ORF较短的基因,40例多个ORF较短的基因被过度预测成单个ORF较长的基因,而栽培黄瓜中分别确定了166例和36例。随后,我们利用野生黄瓜和栽培黄瓜中注释缺失的基因数据作为评估集,分析了我们实验室搭建的注释平台中可能存在的问题,结果发现在数据预处理过程中,RepeatMasker在过滤重复或者低复杂度序列时,会屏蔽部分特殊的基因。其次,部分de novo软件无法在同一DNA的正、负链上都预测基因,而是倾向于保留具有较长ORF的基因,丢弃ORF较短的基因。与此同时,我们也发现蛋白比对软件Scipio拼的蛋白过长,相邻HSP之间的距离过远,这启示我们需要进一步调整Scipio拼接蛋白参数或者更换其它蛋白比对软件用于更加精确的基因组注释。 最后,我们基于Linux/Apache/MySQL平台,整合基因组浏览器-GBrowse,对野生黄瓜和栽培黄瓜的基因组注释信息进行了可视化,同时也借助插件GBrowse_syn,直观地展现了野生黄瓜和栽培黄瓜的全基因组比对信息。
﹀
|
外文摘要: |
Genome annotation is a critical step in assigning biological meaning to DNA sequences. The value of the genome is only as good as its annotation. Various gene-prediction programs developed for genome annotations get remarkable success in these fields. However, its outputs are still unsatisfactory, in particular for complex eukaryotic organisms. Some additional evidence is thus applied to improve the annotation quality. Since collinearity at the DNA level identified from closely related genomes contains valuable signatures for ascertainment of orthologous genes, we can use the collinear information to discover unpredicted genes and to polish gene models within one genome compared with the other. In this study, we attempt to improve the annotations of wild cucumber ( Cucumis sativus var. hardwickii ) and domestic cucumber ( Cucumis sativus var. sativus ) by comparison of whole genome sequences between them. Finally, we found 909 new putative genes in wild cucumber, 853 candidate genes in domestic cucumber. We also got 87 new genes in wild cucumber and 166 new genes in domestic cucumber by merging multiple genes. Meanwhile, we split 40 genes into multiple new genes in wild cucumber, and split 36 genes in domestic cucumber. Then, the analyses of the reason causing annotation anomalies, were performed on the datasets of missing annotations. Our results show that some genes located in regions marked by RepeatMasker, may be miss-annotated. Besides that, some de novo programs have difficulty in predicting overlapping potential ORFs and the parameter of Scipio we set may not optimal. In order to facilitate other researchers obtaining enough information of wild cucumber and domestic cucumber, we integrated the genome browser – GBrowse to show the annotations. Meanwhile, we used GBrowse_syn to display synteny blocks across regions of wild cucumber and domestic cucumber.
﹀
|
参考文献总数: | 3 |
馆藏号: | 硕070120/1301 |
开放日期: | 2013-05-26 |