Data Extraction from the Web Based on Pre-Defined Schema

来源 :计算机科学技术学报 | 被引量 : 0次 | 上传用户：fuzi001

【摘要】

：

With the development of the Internet, the World Wide Web has become an invaluable information source for most organizations. However, most documents available f

【作者】

：

孟小峰陆宏钧王海燕谷明哲

【机构】

：

School of Information,Department of Computer Science

【出处】

：

计算机科学技术学报

【发表日期】

：

2004年期

【关键词】

：

data extraction wrapper generation data integration

【基金项目】

：

国家自然科学基金

下载到本地 , 更方便阅读

下载此文赞助VIP

声明 : 本文档内容版权归属内容提供方 , 如果您对本文有版权争议 , 可与客服联系进行内容授权或下架

论文部分内容阅读

With the development of the Internet, the World Wide Web has become an invaluable information source for most organizations. However, most documents available from the Web are in HTML form which is originally designed for document formatting with little consideration of its contents. Effectively extracting data from such documents remains a nontrivial task. In this paper, we present a schema-guided approach to extracting data from HTML pages. Under the approach, the user defines a schema specifying what to be extracted and provides sample mappings between the schema and the HTML page. The system will induce the mapping rules and generate a wrapper that takes the HTML page as input and produces the required data in the form of XML conforming to the user-defined schema. A prototype system implementing the approach has been developed. The preliminary experiments indicate that the proposed semi-automatic approach is not only easy to use but also able to produce a wrapper that extracts required data from inputted pages with high accuracy.

其他文献

生产价格的成因、基础及其现实意义

社会必要劳动决定的社会价值是在整个社会范围内通过市场竞争形成的。平均利润归根到底只是来源于剩余价值,生产价格是社会规模上的价值。从整个社会角度看,社会商品总生产价

期刊

生产价格社会价值市场价值利润率平均化

湿度变化对TEOM(R)1400a系列环境颗粒物监测仪PM10质量浓度观测的影响

为评价环境湿度变化对TEOM(R)1400a·系列环境颗粒物监测仪进行PM10观测的影响,在2004年1月到2005年1月的观测过程中,获得了30 min平均的大气PM10质量浓度,并观测了地面气象

期刊

大气颗粒物PM10质量浓度湿度

以社区为基础的自然资源管理研究:理论和实践

通过对社区自然资源管理研究的介绍以及对3个国家的行动研究案例的具体分析,总结了社区自然资源管理的一些原则和方法.

期刊

社区自然资源管理参与行动研究

不同退火温度下ZnO薄膜表面形貌的演变及粗化机制

采用磁控溅射方法在Si(001)制备ZnO薄膜,利用原子力显微镜对不同退火温度的ZnO薄膜表面形貌进行表征.结果表明:薄膜的微观形貌、表面粗糙度和分形维数取决于退火温度的变化.

期刊

ZnO薄膜磁控溅射表面形貌功率谱密度分形维数

重金属Cu2+,Pb2+和Zn2+胁迫对近江牡蛎(Crassostrea rivularis) SOD活性影响研究

为探讨将近江牡蛎(Crassostrea rivularis)抗氧化酶防御体系参数作为海洋重金属污染生物监测指标可行性,文章在实验室条件下,研究这三种重金属暴露对近江牡蛎(C. rivularis)

期刊

近江牡蛎重金属超氧化物歧化酶

染料废水的处理工艺

作者自行设计一个上流式厌氧污泥床(UASB)用于染料废水处理工艺条件的优化选择,由此,提出一个针对染料废水的处理工艺方案,并将其应用于某染料厂综合废水的处理,COD的总去除

期刊

染料废水UASB反应器厌氧处理好氧处理

降低儿童16层螺旋CT检查辐射剂量的研究

目的论证CT扫描参数kVp和mAs与剂量和图像噪声的关系,在不影响临床诊断的基础上,修正并验证一种基于成人扫描参数的安全可行的儿童16层螺旋CT检查的扫描参数.方法利用16层螺

期刊

剂量儿童体层摄影扫描参数

等离子体处理玉米对化肥利用率的影响

2004年在吉林省桦甸市桦郊乡进行了等离子体种子4种不同剂量、6个处理区的试验。经过在拔节期、成熟期干物质的化验分析和测产,结果表明应用等离子体种子处理1.0A×2和1.5A×

期刊

等离子体玉米养分吸收量化肥利用率

煤矸石淋滤液中多组分溶质对地下水污染的研究

以溶质运移理论为基础,针对阜新市新邱露天煤矿煤矸石淋滤液对地下水污染实际情况,分析了煤矸石渗滤液在含水层中运移的规律,在综合考虑对流扩散、吸附解吸、生物降解条件下

期刊

煤矸石地下水数值模型有限元法

Design and Applications of Land Resources and Ecological Environment Information System:A Case Study

The design and applications of a land information system built upon ARC/INFO and ArcView arepresented. The proposed system not only maintains all the advantages

期刊

attribute databaseecological environmentland resourcesspatial database

Data Extraction from the Web Based on Pre-Defined Schema

与本文相关的学术论文