论文部分内容阅读
A Web page typically contains many information blocks. Apart from the main content blocks, it usually has such blocks as navigation panels, copyright and privacy notices, and advertisements. We call these blocks the noisy blocks. The noises in Web pages can seriously harm Web data mining. To the question of eliminating these noises, we introduce a new tree structure, called Style Tree, and study an algorithm how to construct a site style tree. The Style Tree Model is employed to detect and eliminate noises in any Web pages of the site. An information based measure to determine which element node is noisy is also constructed. In addition, the applications of this method are discussed in detail. Experimental results show that our noises elimination technique is able to improve the mining results significantly.
A web page typically contains many information blocks. It from has the blocks as navigation panels, copyright and privacy notices, and advertisements. We call these blocks the noisy blocks. The noises in Web pages can be harm harm Web data mining. To the question of eliminating these noises, we introduce a new tree structure, called Style Tree, and study an algorithm how to construct a site style tree. The Style Tree Model is employed to detect and eliminate noises in any Web pages of the site. An information based measure to determine which element node is noisy is also constructed. In addition, the applications of this method are discussed in detail. Experimental results show that our noises elimination technique is able to improve the mining results significantly.