,FrepJoin:an efficient partition-based algorithm for edit similarity join

来源 :信息与电子工程前沿（英文版） | 被引量 : 0次 | 上传用户：clgg1976

【摘要】

：

String similarity join (SSJ) is essential for many applications where near-duplicate objects need to be found. This paper targets SSJ with edit distance constra

【作者】

：

Ji-zhou LUO Sheng-fei SHI Hong-zhi WANG Jian-zhong LI

【机构】

：

School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China“,”Gu

【出处】

：

信息与电子工程前沿（英文版）

【发表日期】

：

2017年10期

【关键词】

：

下载到本地 , 更方便阅读

下载此文赞助VIP

声明 : 本文档内容版权归属内容提供方 , 如果您对本文有版权争议 , 可与客服联系进行内容授权或下架

论文部分内容阅读

String similarity join (SSJ) is essential for many applications where near-duplicate objects need to be found. This paper targets SSJ with edit distance constraints. The existing algorithms usually adopt the filter-and-refine framework. They cannot catch the dissimilarity between string subsets, and do not fully exploit the statistics such as the frequencies of characters. We investigate to develop a partition-based algorithm by using such statistics. The frequency vectors are used to partition datasets into data chunks with dissimilarity between them being caught easily. A novel algorithm is designed to accelerate SSJ via the partitioned data. A new filter is proposed to leverage the statistics to avoid computing edit distances for a noticeable proportion of candidate pairs which survive the existing filters. Our algorithm outperforms alteative methods notably on real datasets.

其他文献

西藏野生大麦与栽培大麦氮利用效率的基因型差异研究

青藏高原是全球大麦多样性中心之一,西藏野生大麦由于长期生长在恶劣环境中,形成了与之相适应的遗传控制系统,表现出很强的环境胁迫耐性。本研究以西藏野生大麦为主要材料,研究耐低氮的基因型差异及生理机制,从而为大麦耐低氮育种和氮肥管理实践提供科学依据。主要研究结果如下：1.野生大麦不同基因型间低氮耐性存在很大的遗传变异。本研究以地上部相对干重为耐低氮评价指标,对82份野生大麦和16份栽培大麦进行0.4mm

学位

西藏野生大麦(Hordeum vulgare subsp.spontaneum)低氮耐性全基因组关联分析(GWAS)吸收动力学光合参数根系形态

,FrepJoin:an efficient partition-based algorithm for edit similarity join

其他学术论文