论文部分内容阅读
文本聚类具有数据稀疏性的特点,常见的聚类方法采用基于距离的相异度,为了增强文档的区分特征,提出一种基于非对称相似度的方法,来度量文档对象之间的关联。定义了文本对象之间的非对称相似度度量。利用文本非对称相似度矩阵的稀疏特性,采用强连通构件的划分方法对文本对象进行聚类分析。并通过迭代的方法形成聚类结果的概念层次。实验结果表明:非对称相似度比距离相异度具有更高的准确率和更少的执行时间,当聚类结果簇数目达到较小时,准确率提高约为20%。
Text clustering has the characteristics of data sparsity. Common clustering methods use distance-based dissimilarity. In order to enhance the distinguishing features of documents, a method based on asymmetric similarity is proposed to measure the association between document objects. Defines asymmetric similarity measures between text objects. By using the sparseness of the asymmetric similarity matrix of texts, the text objects are clustered by the partition method of strongly connected components. And through the iterative method to form the conceptual level of clustering results. Experimental results show that asymmetric similarity has higher accuracy and less execution time than distance dissimilarity. When the clustering result reaches a small number, the accuracy rate is about 20%.