查看论文信息

查看全文

查看论文信息

中文题名：	综合风险主题爬虫的研究与实现
姓名：	何毅超
保密级别：	公开
学科代码：	081202
学科专业：	计算机软件与理论
学生类型：	硕士
学位：	工学硕士
学位年度：	2009
校区：	北京校区培养
学院：	信息科学与技术学院
研究方向：	网络与信息系统
第一导师姓名：	周明全
第一导师单位：	北京师范大学信息科学与技术学院
提交日期：	2009-06-09
答辩日期：	2009-06-06
外文题名：	Research and Implementation of Integrated Risk Focused Crawler
中文摘要：	︿由于互联网上的信息量呈几何级数增长，用户对于网络信息的应用需求也不断提高，传统的搜索引擎也不能保证综合风险信息的全面性和精准性。因此，必须构建综合风险的信息检索平台，利用现代信息技术对网络上的综合风险灾害信息予以分析，从而进行控制和引导。面对高速增长的巨大的网页数量，面向特定领域的垂直搜索引擎将更为实用，并会成为综合搜索引擎的互补形式。主题爬虫是垂直搜索引擎系统中最为关键也最为复杂的组件之一，它的性能直接决定了搜索引擎的最终效果。面向主题的网络爬虫的研究已经成为下一代搜索引擎和网络爬虫技术的研究热点。本论文的研究工作主要包含以下四个方面：（1）文章首先分析了当前的主题爬虫研究进展，提出了综合风险主题爬虫模型与主题信息发现策略，构建了主题检测引擎和主题域，对超链接的主题相关性进行判定。（2）根据主题论坛爬行的特点，提出了一种基于论坛软件探测算法（WFSD）的论坛爬行策略，提高了论坛网页抓取的效率，该策略的收获率和召回率都明显优于广度优先算法。（3）分析了多种多模式匹配算法，并在Modified Wu-Manber算法的基础上，针对中文文本和主题检测引擎的实际需求进行了改进，提出并实现了一种快速多字节字符串匹配算法（FMBWM），该算法可以加快中文字符串的处理速度。（4）在上述理论和技术基础上，设计并实现了一个综合风险主题爬虫（IRFC），该爬虫具有很好的收获率和良好的可扩展性。本课题得到了“十一五”国家科技支撑计划重点项目——“综合风险防范（IRG）关键技术研究与示范”（2006BAD20B02）支持。﹀
外文摘要：	︿ With the rapid expand and growth of web pages information from the WWW, it gets harder to retrieve the information and knowledge relevant to the Integrated Risk. Therefore, it is necessary to construct the retrieval platform for the Integrated Risk information. The disasters information can be analyzed, controlled or guided through the platform. Along with the boom of the web, the topic-specific search engine will be more useful and will become an efficient complementary form of general search engine. Focused crawler is a key component of the vertical search engine because the final outcome of search engine depends on the performance of crawler. The research of the focused crawler becomes a hot issue of the next generation of search engine and crawler.The major research work and contributions of this dissertation are as follows:(1)Research progress of focused crawler is investigated. Based on the analysis of traversal strategy, a model of focused crawler is proposed in this paper. The crawler uses various strategies for information discovery strategy. A topical detection engine and topical domain are built to predict a web page's topical relevancy before downloading it. The efficiency of the crawler is improved. (2)For the problem of Web forum crawling, a new multiple pattern matching algorithm is proposed in this paper, which is based on “Web Forum Software Detection” algorithm. The harvest-rate and recall-rate are greatly improved comparing with breadth-first strategy.(3)Multiple pattern Matching algorithms are analyzed. A new multiple pattern Matching algorithm for Chinese string matching is provided and implemented, which is based on “Fast Multi-Byte Wu-Manber” algorithm. This algorithm can be used to speed up the process of Chinese string matching. (4)Integrated Risk Focused Crawler is designed and implemented, which has a high harvest-rate and excellent expansibility.The research is supported by Key National Science and Technology Project of the “11th Five-Year Plan”, “Key Technology Research and Demonstration of Integrated Risk Guardians” (No. 2006BAD20B02). ﹀
参考文献总数：	66
作者简介：	何毅超，男，研究方向：网络与信息系统，网络安全，搜索引擎。硕士期间发表的论文：何毅超. 文件夹加密原理和破解方法的综述[J]. 网络安全技术与应用. 2008, (1): 89-91
馆藏号：	硕081202/0904
开放日期：	2009-06-09

附件下载