论文部分内容阅读
【目的】通过构建简单数据样本,解决传统网页类型识别方法效率低的难题。【方法】采用URL特征作为识别依据,抽取URL信息构建训练集与测试集,使用支持向量机(SVM)建立机器学习模型以提高识别效率。【结果】在同样的数据集上,该方法的准确率为91.2%,优于其他识别方法。在效率性能方面,该方法提升近60%。【局限】当遇到URL特征不明显甚至完全相背的网站时,识别准确率会大幅度降低。【结论】该方法在效率方面存在很大优势,应用到采集系统中可提高采集效率。
【Objective】 To solve the problem of low efficiency of traditional web page type identification by constructing simple data samples. 【Method】 The URL features were used as the identification basis, the URL information was extracted to construct the training set and test set, and the support vector machine (SVM) was used to establish the machine learning model to improve the recognition efficiency. 【Result】 On the same data set, the accuracy of this method was 91.2%, which was better than other methods. In terms of efficiency and performance, this approach has increased by nearly 60%. [Limitations] When encountering websites with undefined or even opposite URL characteristics, the recognition accuracy will be greatly reduced. 【Conclusion】 The method has great advantages in terms of efficiency. It can be used in the acquisition system to improve the collection efficiency.