论文部分内容阅读
【目的】开发网络信息存档WARC文件的解析与索引系统,充分挖掘科技网站存档资源价值。【应用背景】在网络资源采集存档领域,WARC文件格式获得了广泛的应用。随着网络信息的多样化,已有的WARC文件索引工具越来越难以满足用户多样性的查询需求。【方法】采用模块化方案解析WARC文件。分析比较常用的索引工具,选择Solr平台开发全文索引系统。【结果】实现对WARC文件基于内容的检索访问服务,并在WARC的索引中增加了学科分类、资源类型和存档时间等分面检索内容,从多维度对WARC文件内容进行揭示。【结论】向用户提供了丰富的科技网站存档数据信息,提高了用户检索访问效率。
[Objective] To develop a web information archive WARC file parsing and indexing system to fully tap the value of scientific and technological web site archived resources. Application Background In the field of network resource collection and archiving, WARC file format has been widely used. With the diversification of network information, the existing WARC file indexing tool is increasingly difficult to meet the user’s diverse query needs. 【Method】 Analyze WARC file with modular scheme. Analyze the more commonly used indexing tools, choose Solr platform to develop full-text indexing system. 【Result】 The content-based retrieval and access service to WARC files was realized. In the index of WARC, the content of subject classification, resource type and archive time were added to retrieve the contents of WARC files, revealing the content of WARC files from multiple dimensions. 【Conclusion】 The user is provided with a wealth of science and technology website archived data and information to improve the retrieval efficiency of the user.