论文部分内容阅读
引入编辑距离的概念,探讨如何构造标签树,并利用标签树匹配算法来量化网页结构相似度。该算法被应用于Web信息抽取,通过URL相似度算法进行样本网页的粗聚类,进一步采用树的相似度匹配算法实现细聚类,从而获取模板网页。在模板网页的基础上,再次引入结构相似度算法并结合基于模板网页的抽取规则实现网页的自动化抽取。实验证明,该算法的引入能够有效提高包装器的抽取精度和半自动化能力。
This paper introduces the concept of edit distance, discusses how to construct tag tree, and uses tag tree matching algorithm to quantify web page structure similarity. The algorithm is applied to Web information extraction, through the URL similarity algorithm for coarse clustering of sample web pages, further using the tree similarity matching algorithm to achieve clustering, to obtain template web pages. Based on the template webpage, the structure similarity algorithm is introduced again and the webpage is extracted automatically by combining the extraction rules based on the template webpage. Experiments show that the introduction of the algorithm can effectively improve the packaging accuracy and semi-automatic extraction ability.