- 无标题文档
查看论文信息

中文题名:

 基于向量支持主题识别的文本关键词提取    

姓名:

 王韬略    

学科代码:

 120502    

学科专业:

 情报学    

学生类型:

 硕士    

学位:

 管理学硕士    

学位年度:

 2012    

校区:

 北京校区培养    

学院:

 管理学院    

研究方向:

 文本分析    

第一导师姓名:

 耿骞    

第一导师单位:

 北京师范大学    

提交日期:

 2012-05-28    

答辩日期:

 2012-05-25    

外文题名:

 Text keywords extraction based on support vector theme identification    

中文摘要:
本文针对网络短文本,利用相关文本主题和关键词提取算法,对其具有的相应特征和传播特性,设计具有可操作意义的改进算法,针对网络文本的传播的快捷程度和与传统文本所不同的结构特征,进行适宜网络环境的适应性设计,获得针对该类文本的形式化,可兹借鉴和推广的,具有广泛应用前景的文本主题关键词提取算法,为文本观点倾向性分析,大规模网络舆情分析,以及包括本体构建在内的一系列相关研究作出相应的试验性探索。主要内容包括:1)汉语文本处理的相关技术和方法,包括汉语自动分词与词性标注,句型成分分析与词频权值计算,歧义结构及其处理策略。2)关键词提取的算法设计,文本分通过对比分析研究传统的算法和各提取模型的优缺点,改进并提出了新的参数设定模型,引入依赖关系,文本全局特性,上下文关联语境等相关信息,对特征向量的取值及实验效果进行了广泛的实验和对比,并综合现有各种方法的优点和不足,提出了具有实践意义的,具有可资证明的大幅度效率提升的关键词提取算法改进。3)其他研究,本文还重点研究了基于统计与规则相结合的关键词提取算法的原理、样本特征选取、参数估计和文本分类算法,调研分析了近年来基于SVM改进的各种衍生模型,为关键词提取的改进提供理论和实践基础和依据。4)本文从CAOE获取的现实文本语料,针对现实网络短文本提出的文本关键词改进算法可以广泛的应用到相关领域,也是对新环境文本处理和语义理解的有效补充和铺垫。
外文摘要:
This paper focuses on online short text, use related text topic and keyword extraction algorithm, the corresponding characteristics and the propagation characteristics of the design have operational significance of the improved algorithm for online text spread and adaptive design of appropriate network environments compared with traditional text structural characteristics , formalized algorithm for such text may hereby learn and promotion, has broad application prospects of text topic keywords extraction for orientation analysis of the text view, large-scale network public opinion, analysis, and ontology construction, including a series of related research and make the corresponding experimental exploration..The main contents include: 1) The Chinese text processing techniques and methods, including a Chinese automatic segmentation of speech tagging, sentence composition analysis and word frequency weights calculated ambiguity structure and processing strategies. 2) keyword extraction algorithm design, the text analyzed by comparing the advantages and disadvantages of the traditional algorithm and the extracted model, improve and propose a new parameter setting model, introduce the dependencies, the text of the global characteristics of contextual context information, values and experimental effects of the feature vector with a wide range of experimental and comparison, and the advantages of existing method sand shortcomings, the key to the practical significance of proven substantial efficiency word extraction algorithm improvements. 3) other studies, this article focuses on the principles of keyword extraction algorithm based on a combination of statistical and rule, the sample feature selection, parameter estimation and text classification algorithm, the research has improved in recent years based on SVM derived model for the keyword extraction improvements provide the basis and foundation of the theory and practice. 4) This article obtained real-text corpus from CAOE, the text keywords for the short version of the real network to improve algorithms can be widely applied to related fields, but also an effective complement to the new environment, text processing and semantic understanding and pave the way for future research.
参考文献总数:

 53    

馆藏号:

 硕120502/1203    

开放日期:

 2012-05-28    

无标题文档

   建议浏览器: 谷歌 360请用极速模式,双核浏览器请用极速模式