查看论文信息

查看全文

查看论文信息

中文题名：	基于主题的Web资源采集系统的设计与实现
姓名：	舒维
保密级别：	内部
学科代码：	120502
学科专业：	情报学
学生类型：	硕士
学位：	管理学硕士
学位年度：	2008
校区：	北京校区培养
学院：	管理学院
研究方向：	信息管理技术
第一导师姓名：	李广建
第一导师单位：	北京师范大学
提交日期：	2008-06-10
答辩日期：	2008-06-01
中文关键词：	主题搜索引擎 ; Nutch ; Lucene ; MVC ; AJAX ; 中文分词 ; 向量空间模型
中文摘要：	︿在网络信息时代的今天，信息量不断在Web上增长，如何获取高质量的信息已成了一个热门的话题。传统的搜索引擎虽然能快速的为用户检索出相关的网络资源，但是需要耗费巨大空间和时间的网页累积，并且缺乏多元化的结果组织。对于有某些特殊专题资源需求的机构来说，一个轻量级的，面向主题的Web资源采集系统更为有用。目前搜索引擎研究的一个热点问题——主题搜索引擎，就是以构筑某一主题或学科领域的Web信息资源库为目标，侧重于主题相关网页的获取，采取一定机制，滤出不相关的网页，只覆盖与特定主题相关的Web区域，因此它的爬行层次可以更深，爬行周期可以更短。在查询结果排序时，给予主题相关度高的网页更高的优先级，因此可以满足用户对获取信息资源的快速、准确和全面的要求。本文研究如何在开源搜索引擎项目Nutch的基础上，采用Struts MVC（模型-视图-控制器）框架和AJAX（异步JavaScript和XML）动态网页技术，设计一个具有良好用户界面的、能够实际应用的、面向主题的Web资源采集系统，实现按照用户定义的主题对Web资源进行采集和处理，以此构建一个专题资源库，并采用多种方式为用户提供所需的资源。本文首先简要介绍主题搜索引擎的发展现状，然后对Nutch进行介绍，随后详细讨论系统的总体设计，探讨如何利用MVC结构和AJAX技术构建稳定而灵活的系统架构，在后面的章节中将会详细讨论主题爬虫的爬行策略、中文分词、主题相关性判定和资源过滤等具体细节问题，并对该系统进行展示和评价，最后对本研究进行总结，并对下一步的研究做出展望。﹀
外文摘要：	︿ With the rapid growth of information on internet in nowadays, it’s becoming more and more important for users to obtain high quality information. Though traditonal search engines can quickly retrive relative info per user’s query, they require such large space and long time to accumulate web resources, yet the they lack multiform organizations of search results. So, a light-weighted, topic-specific web resource collecting system is more applicable for some institutions and individuals who have special information needs. At present, topic-specific search engine, which is aimed to build a web resource repository of specific topics or disciplines, is becoming a hotspot of search engine researchs. It emphasizes particularly on crawling web pages related to desired topics, filters out non-relative information via multiform methods, only covers specific internet areas that is topic-related. Thus, this kind of search engines can crawl deeper than traditional search engines, while takes lesser time to achieve the goals. When presenting search results, topic-specific search engines can give topic-related resources higher priority, which makes it more clear for user to find the information they most want, hence satisfying user’s information demands quickly, precisely and comprehensively.This thesis researchs mainly on how to realize a user friendly, pratical and topic-specific web resource collecting system based on Nutch, an Apache’s open source search engine project, using Struts MVC framework and AJAX dynamic web techniques. This system is designed to crawl and process information on internet according to the topics that are set by user, then establishes a resource repository based on collected information, finally provides these resources to users via many ways. The thesis first introduces the status in quo of topic-specific search engines, then gives a brief introduction about Nutch, afterwards we discuss the strategy of topic-specific crawler, Chinese word segmentation, topic similarity judgement algorithm and resource filtering in detail, and evaluates the system’s performance and running status. At last the thesis gives a summary of the dissertation work, and point out the direction of future development and work to constantly update and improve. ﹀
参考文献总数：	24
馆藏号：	硕120502/0817
开放日期：	2008-06-10

附件下载