论文部分内容阅读
信息时代的到来,使计算机网络的使用频率越来越高,互联网上的信息也越来越多。广大Internet用户在使用搜索引擎系统时,常常会发现搜索输出结果里面包含大量的重复信息。如何快速准确地发现这些内容上相似的网页,并将重复的页面清除是目前最关注的问题。网页去重是提高检索质量的有效途径。给出一种基于散列思想的网页去重系统,介绍系统的具体实现步骤,算法有较高的判断正确率,在信息检索中有较好的应用前景。
The advent of the information age, the use of computer networks more and more high-frequency, more and more information on the Internet. The majority of Internet users in the use of search engine systems, often find the search output contains a lot of duplicate information. How to quickly and accurately find similar content on these pages, and duplicate pages clear is the most concern. Web page weighting is an effective way to improve the quality of search. A Web page deduplication system based on hashing idea is presented. The concrete implementation steps of the system are introduced. The algorithm has a higher correctness of judgment and a better application foreground in information retrieval.