论文部分内容阅读
网页分类算法是目前比较热门的研究课题,目前已经有许多网页分类算法,其中TFIDF算法是一种用于信息检索与数据挖掘的常用加权技术,本文通过TFIDF算法提取了每个分类下的具有高区分度的特征词,在网页分类时通过找出其中最能代表该网页的词素,依据该词素的类别信息即能对网页进行分类。由于TFIDF算法中词频计算未考虑网页结构信息,因此在本文中对词频计算进行了改进,通过对网页结构分类,计算词素出现在不同分类下的权重,达到对网页信息的合理利用。
Web page classification algorithm is a hot research topic at present. There are many web page classification algorithms, among which TFIDF algorithm is a common weighting technology used in information retrieval and data mining. In this paper, The distinguishing feature words can be classified according to the category information of the morpheme by finding the morpheme most representative of the web page when the web page is classified. Since the word frequency calculation in TFIDF algorithm does not take into account the information of webpage structure, we improve the word frequency calculation in this paper. We classify the webpage structure and calculate the weight of morpheme under different classification to achieve the reasonable utilization of webpage information.