论文部分内容阅读
针对传统的相似度计算方法在海量信息处理过程中暴露出的数据处理规模限制和性能不足等方面的瓶颈问题,以非结构化文档为研究对象,提出一种基于Hadoop分布式环境,结合Hive数据处理平台和PostgreSQL关系型数据库的文档相似度计算方法,并给出关键技术思路、具体实现步骤和实证研究,通过研究证明Hive SQL语言可有效简化分布式数据处理的复杂性,但实时性有待改进。
Aiming at the bottleneck problem of traditional data processing scale limit and performance insufficiency exposed by the traditional similarity calculation method in mass information processing, this paper takes unstructured document as research object, and proposes a Hadoop distributed environment based on Hive data Processing platform and PostgreSQL relational database document similarity calculation method, and gives the key technical ideas, specific steps and empirical research proved that Hive SQL language can effectively simplify the complexity of distributed data processing, but the real-time needs to be improved .