论文部分内容阅读
文中基于Spark和多叉树对传统Apriori算法进行改进,将原始事物数据库转换为布尔矩阵,切割成多个分区数据库后交由Spark的各个Worker节点处理,以多叉树的形式存储中间结果,最后交由主节点进行合并,得到全局频繁项集.将基于Spark实现的Apriori算法同Ha doop环境下的Apriori算法进行性能对比,发现在数据量相同的情况下基于Spark的Apriori算法较基于Hadoop的Apriori算法执行时间减少了67%以上,采用多叉树存储中间结果后,算法执行时间在原来的基础上减少了44%以上.文中实验证明了Spark比Hadoop更适用于Apriori这种以迭代搜索方式执行的算法,且采用多叉树存储中间结果可有效地提高算法执行效率.“,”This paper presents an improved Apriori algorithm based on Spark framework and multi-tree.First, the original transaction database is converted to boolean martrix and divided subsets.Then partition multi-trees are generated by the worker nodes of Spark, finally, master node generates the global multi-tree by merging partition multi-trees.Compared with Hadoop framework, Spark decreases the algorithm executing time by more than 67%.By saving the temporary result into multi-tree, the time-consuming is at least decreased by 44%.The experiment indicates that Spark and multi-tree can effectively improve the performance of Apriori algorithm.