论文部分内容阅读
通过分析动态数据在其Web页面中的展示特点,提出一个新的自动化、结构化数据抽取方法。首先基于DOM利用算法实现快速定位数据区,从而避免处理大量噪音数据;其次引入最小DFS编码来表示DOM子树,通过聚类对记录数据区进行区分;最后对少量样本页面训练学习生成抽取规则用于数据抽取。利用原型系统针对实际网站中的页面进行数据抽取,实验结果显示其拥有较高的准确性和效率。
By analyzing the display characteristics of dynamic data in its Web pages, a new automated and structured data extraction method is proposed. Firstly, based on the DOM algorithm, the data area is quickly located so as to avoid processing a large amount of noise data. Secondly, a minimal DFS coding is introduced to represent the DOM sub-tree and the data area is distinguished by clustering. Finally, a small amount of sample pages are trained to generate extraction rules In data extraction. Using the prototype system to extract data from the pages in the actual website, the experimental results show that it has higher accuracy and efficiency.