论文部分内容阅读
爬虫是搜索引擎的重要组成部分,它沿着网页中的超链接自动爬行,搜集各种资源。为了提高对特定主题资源的采集效率,文本分类技术被用来指导爬虫的爬行。本文把基于支持向量机的文本自动分类技术应用到化学主题爬虫中,通过SVM 分类器对爬行的网页进行打分,用于指导它爬行化学相关网页。通过与基于广度优先算法的非主题爬虫和基于关键词匹配算法的主题爬虫的比较,表明基于SVM分类器的主题爬虫能有效地提高针对化学Web资源的采集效率。
Crawlers, an important part of search engines, crawl automatically along hyperlinks on web pages to gather resources. In order to improve the efficiency of collection of resources on a specific topic, text classification technology is used to guide the reptiles crawling. This paper applies SVM-based text automatic classification technology to chemical subject crawler, and scans the crawling webpage by SVM classifier to guide it to crawl the chemistry related webpage. The comparison with the theme crawler based on the breadth-first algorithm based on the breadth-first algorithm and the keyword matching algorithm shows that the theme crawler based on the SVM classifier can effectively improve the collection efficiency for the chemical Web resources.