,FrepJoin:an efficient partition-based algorithm for edit similarity join

来源 :信息与电子工程前沿(英文版) | 被引量 : 0次 | 上传用户:clgg1976
下载到本地 , 更方便阅读
声明 : 本文档内容版权归属内容提供方 , 如果您对本文有版权争议 , 可与客服联系进行内容授权或下架
论文部分内容阅读
String similarity join (SSJ) is essential for many applications where near-duplicate objects need to be found. This paper targets SSJ with edit distance constraints. The existing algorithms usually adopt the filter-and-refine framework. They cannot catch the dissimilarity between string subsets, and do not fully exploit the statistics such as the frequencies of characters. We investigate to develop a partition-based algorithm by using such statistics. The frequency vectors are used to partition datasets into data chunks with dissimilarity between them being caught easily. A novel algorithm is designed to accelerate SSJ via the partitioned data. A new filter is proposed to leverage the statistics to avoid computing edit distances for a noticeable proportion of candidate pairs which survive the existing filters. Our algorithm outperforms alteative methods notably on real datasets.
其他文献
青藏高原是全球大麦多样性中心之一,西藏野生大麦由于长期生长在恶劣环境中,形成了与之相适应的遗传控制系统,表现出很强的环境胁迫耐性。本研究以西藏野生大麦为主要材料,研究耐低氮的基因型差异及生理机制,从而为大麦耐低氮育种和氮肥管理实践提供科学依据。主要研究结果如下:1.野生大麦不同基因型间低氮耐性存在很大的遗传变异。本研究以地上部相对干重为耐低氮评价指标,对82份野生大麦和16份栽培大麦进行0.4mm