- 无标题文档
查看论文信息

中文题名:

 基于保守DNA片段的共线性识别及其在黄瓜基因组功能注释中的应用    

姓名:

 宋宏涛    

保密级别:

 公开    

论文语种:

 中文    

学科代码:

 071300    

学科专业:

 生态学    

学生类型:

 博士    

学位:

 理学博士    

学位类型:

 学术学位    

学位年度:

 2018    

校区:

 北京校区培养    

学院:

 生命科学学院    

研究方向:

 生物信息学    

第一导师姓名:

 林魁    

第一导师单位:

 北京师范大学生命科学学院    

提交日期:

 2021-04-22    

答辩日期:

 2017-05-16    

外文题名:

 THE IDENTIFICATION OF THE COLLINEARITY BASED ON CONSERVED DNA SEGMENTS AND ITS APPLICATION IN FUNCTIONAL ANNOTATION OF CUCUMBER GENOME    

中文关键词:

 黄瓜基因组 ; 基因功能注释 ; 直系同源 ; 共线性片段 ; 编码蛋白基因 ; 保守的非编码序列 ; 植物基因组 ; 葫芦科    

外文关键词:

 Cucumber genome ; Gene functional annotation ; Orthology ; Collinear segments ; Protein-coding gene ; Conserved noncoding sequenc ; Plant genome ; Cucurbitaceae    

中文摘要:

随着测序技术的快速发展,已经有越来越多的物种完成了全基因组测序,同时得到了大量预测出来但未经实验验证的蛋白质序列,在非模式物种中这一现象尤为严重。因此,对这些预测蛋白质进行功能注释成为当前生物信息学和基因组学研究的重要任务之一。另一方面,通过对多个具有亲缘关系基因组的比较分析,科学家也发现了越来越多由多个物种共有的保守非编码序列(Conserved Noncoding Sequences, CNSs),且有证据表明它们可能参与了基因转录和基因表达调控等生物学过程。

传统的基因功能注释策略主要是依赖于基因或者蛋白质序列的相似性比较,但是基因功能的传递一般需要准确的直系同源关系作为基础。共线性信息是进行直系同源推断的主要依据之一,目前识别共线性的方法主要借助于基因顺序和方向的保守性,但已有研究表明它在处理植物基因组时具有一定的局限性。全基因组加倍、频繁地基因转座和水平基因转移以及高比例的转座元件和串联重复序列等事件常常打乱了植物基因组中的基因顺序。另外有越来越多的比较基因组学的研究表明进化的基本单位正在由基因转变为基因组中更短更稳定的结构单元。基于此认识,我们围绕黄瓜基因组的功能注释,开展了以下四方面的工作:

1)我们选择葫芦科及其相关物种共15个被子植物基因组,采用LASTZ/MULTIZ流程获得了15-way的全基因组比对结果,提取所有比对上的保守区块blocks)作为后续共线性探测和功能元件识别的锚点序列(简称为MAAsMultiple Alignment Anchors)。对MAAs与编码蛋白基因家族这两类基因组标记的特征比较分析表明MAAs是一类不依赖于基因结构注释且在基因组上分布更均匀、覆盖范围更广但长度较短的DNA水平上的保守标记。

2)分别利用MAAs和编码蛋白基因这两类基因组标记信息识别了黄瓜与其他相关14个被子植物基因组在不同层次上的共线性片段。我们发现基于MAAs标记识别的共线性片段平均长度较短但片段数目远超过编码蛋白基因标记的方法,且共线性片段分布范围更广,覆盖了很多基因间区及非编码区。两类标记识别共线性结果的Jaccard相似性分析表明它们在多物种水平的共线性识别方面存在较大的差异,但在两个物种的共线性水平上(尤其是分化距离较近的物种)二者相似性很高。尽管随着分化距离越来越远二者的差异也越来越明显,但是基于MAAs方法能够识别到的特异片段的比例越来越多。我们认为基于MAAs的方法对于分化距离较远的物种间的共线性片段的识别能力要优于以编码蛋白基因为标记的方法。

3)基于黄瓜及其相关14个被子植物基因组之间的共线性信息进行了黄瓜与其他植物之间直系同源基因的推断,然后依据直系同源关系对黄瓜中的编码蛋白基因的生物学功能完成了注释。我们将基于共线性信息的功能注释结果与传统方法了比较,结果表明:基于共线性的方法与传统方法在基因覆盖度上基本相当,但是注释冗余度上要明显优于传统方法。不同注释结果的相似性分析表明基于共线性的功能注释方法是一种考虑较为全面的功能传递策略,而不仅仅是基于序列相似性的比较,同时它还可能捕获到了不同物种之间的基因家族的信息。另外基于基因共表达信息对注释结果的验证实验表明基于共线性信息的功能注释方法相比于传统方法具有更高的注释质量。

4)我们利用共线性信息和多物种全基因组比对在三个葫芦科物种中识别了4,636CNSs。这些CNSs大多分布在基因的上下游及基因的内含子区,且长度较短,序列中存在显著富集的序列模体特征(motif)。与已知的动物或其他物种的CNSs类似的是,葫芦科植物CNSs的邻近基因也常常富集到与植物发育、DNA/RNA结合等相关的生物学功能。有意思的是,这些CNSs的中心区域相比于两侧序列的AT含量表现出显著下降的特征,而且其中心区域的核小体占位显示出高倾向性,这些结果表明葫芦科植物CNSs可能在基因转录及表达调控中发挥一定作用。

本文整合基于保守DNA片段所推断出的共线性信息,在全基因组水平上完成了黄瓜基因组中编码蛋白基因的功能注释,同时识别和注释了在三个葫芦科植物中高度保守的非编码序列元件(CNSs)。通过该实例研究,我们搭建了一种基于保守DNA片段为标记来识别多物种之间共线性片段的工作流程,相比传统的基因标记方法具有一定的优势:可以识别出数量更多长度较短的共线性片段且更适用于识别分化时间较长的物种间的共线性信息。基于共线性信息实现编码蛋白基因功能注释的结果也表明整合共线性信息能够改进基因功能注释的质量。最后我们讨论了本方法的缺陷与不足,为进一步开展工作指明方向。

外文摘要:

With the rapid development of high-throughput sequencing technology, more and more genomes have been assembled accompanying a large number of predicted protein-coding genes. However, most of these predicted protein-coding genes have not been functionally validated with any experimental strategy, especially for those non-model species. Thus, one of the urgent tasks in the post-genome era is to assign accurate functional annotation for these predicted protein-coding genes. Meanwhile, by comparative analysis of a set of related genomes, a large number of conserved noncoding sequences (CNSs) that were conserved across many species were identified. Interestingly, many if not most of these CNSs are the candidates of the regulatory elements involving in gene transcription and regulation. 

    Generally, newly predicted genes function annotation relies on traditional approaches such as sequence similarity comparison or conserved domain identification. Now, as we know, accurate orthology relationships among genes are better and more robust proxies for function transfer to the predicted genes. Moreover, genes’ collinear information is one of most reliable strategies for orthology inference among genes from multiple related genomes. Currently, the identification of collinear segments is mainly based on the conservation of gene order and orientation. During the course of plant genomes evolution, unfortunately, the events such as whole genome duplication, the reshuffling of short DNA segments by mobile elements and HGT (horizontal gene transfer), usually destroy or distort the gene order along chromosome, making genes as genome-wide markers difficult in plant genome comparison. In fact, more and more comparative genome studies have demonstrated that smaller units such as evolutionarily stable domains or segments are better as genome-wide markers in whole genome comparison. In this study, we used the cucumber (Cucumis sativusL.) genome as a case study and attempted to accurately annotate its potential functional elements genome-wide. To this end, four major tasks were performed as follows.

    First, using the localized pipeline for LASTZ/MULTIZ analysis, we aligned 15 genomes including cucumber and other 14 related angiosperms plants sequenced, and identified a set of DNA segments highly conserved across all 15 angiosperms. The aligned blocks included in the alignment chain were considered as the anchor sequences (Multiple Alignment Anchors, MAAs), which were being used as markers to identify the collinear segments and functional regulatory elements. Compared with protein-coding genes, the identified MAAs were of shorter in length, greater in number, and more uniform in distribution within the cucumber genome. Importantly, the set of identified MAAs seemed to be independent of the annotated gene structures.

    Second, for these 15 genomes, using the identified MAAs and the annotated protein-coding genes as two different sets of genome-wide markers, we inferred two sets of the collinear genomic segments between cucumber and 14 other genomes respectively. Again, contrasting with the result from protein-coding genes as markers, the collinear segments identified based on MAAs were shorter in length, greater in number, and also covered many intergenic or noncoding regions. The Jaccard similarity measures showed that only a weak relationship was observed on the multiple species levels, although there were pairwise relationships with very high similarity, especially for the closely related genomes. Though the Jaccard similarity between two sets of collinear genomic segments decreased rapidly with the increase of divergence time from cucumber, we found that the MAAs-based method could detect more specific collinear segments than that based on protein-coding genes. Our results supported the finding that, as genome-wide markers, MAAs may be more suitable for identifying collinear segments among distantly related genomes than protein-coding genes.

    Third, with the collinear segments identified previously, we inferred putative orthologous gene pairs between the cucumber and each of the 14 other species. Each of these orthologous gene pairs then was used to be proxy when transferring the annotation information to the corresponding cucumber gene if there was known biological function. Our results showed that such gene functional annotation strategy greatly reduced the annotation redundancy, although the annotation coverage could be comparable with other traditional methods. Thus, collinearity-based method was more comprehensive than that based on sequence similarity comparison. Meanwhile, we also assessed the annotation quality using gene co-expression profiles, indicating that the annotation accuracy could be improved when considering the genomic collinearity information.

    In addition, we also identified 4,636 CNSs that originated from the alignment of all the 15 genomes. For the set of CNSs that distributed in the collinear segments of three Cucurbitaceous genomes, we found that most of them with short length scattered at the upstreams, introns or downstreams of the annotated genes. Moreover, these CNSs were often flanked by the annotated genes involved in plant development or DNA/RNA-binding processes, some of which were found to enrich significantly with conserved motifs. Interestingly, a drop of A+T content near the border of the CNSs and a higher nucleosome occupancy probability in the center of the CNSs were also observed. These results suggested that many if not most of the Cucurbitaceous CNSs could be the candidates for regulatory elements involved in gene transcription and expression regulation. 

    In summary, we identified the collinear genomic segments using the conserved DNA segments as markers across 15 plant genomes. Compared with those segments using genes as markers, although shorter in length, they were much more in number, particularly for the distantly related genomes. Based on these collinear genomic segments, we performed the functional annotation of the predicted cucumber protein-coding genes, identified and annotated the putative Cucurbitaceous CNSs around their adjacent genes. This study demonstrated that the quality of cucumber gene functional annotation could be partly improved by integrating such inferred collinear genomic segments with traditional protein sequence comparison.

参考文献总数:

 151    

作者简介:

 宋宏涛,山西晋中人,2002年考入北京师范大学生命科学学院生物科学专业,2006年保送进入北京师范大学生命科学学院生化与分子生物学研究所攻读硕士学位,2009年留校工作,2013年考入北京师范大学生态学研究所计算分子生物学实验室师从林魁教授攻读博士学位,主要研究兴趣是植物功能基因组的注释及保守非编码元件的预测识别。    

馆藏地:

 图书馆学位论文阅览区(主馆南区三层BC区)    

馆藏号:

 博071300/18019    

开放日期:

 2022-04-22    

无标题文档

   建议浏览器: 谷歌 360请用极速模式,双核浏览器请用极速模式