论文部分内容阅读
本文提出了基于树先剪枝技术和信息熵的抽取网页正文新方法。该方法通过对网页上的各种模板和正文进行分析,提取按照信息熵定位的正文网页,把该正文网页转化成DOM树,再删除噪音节点,生成抽取公共路径,抽取相关网页。经过试验验证,该方法降低了搜索的复杂度,提高了搜索的准确度,提高了搜索效率。
In this paper, we propose a new method of extracting web page text based on tree pruning technique and information entropy. This method analyzes the various templates and texts on the webpage, extracts the webpage which is located according to the information entropy, converts the webpage into the DOM tree, deletes the noise nodes, generates the public path and extracts the related webpages. After experimental verification, this method reduces the search complexity, improves the search accuracy and improves the search efficiency.