论文部分内容阅读
Clustering is one of the most common data mining tasks, used frequently for data categorization and analysis in both industry and academia. In many domains where clustering is applied, some prior knowledge is available either in the form of labeled data(specifying the category to which an instance belongs) or pairwise constraints on some of the instances(specifying whether two instances should be in same or different clusters). The focus of our research is on semisupervised clustering, where we study how prior knowledge can be incorporated into clustering algorithms.Semi-supervised clustering aims to improve the clustering performance by considering user supervision in the form of pairwise constraints. However, most current algorithms are passive in the sense that pairwise constraints are provided beforehand and selected randomly. This may lead to the use of constraints that are redundant, unnecessary, or even harmful to the clustering results. For those reasons, we would like to optimize the selection of the constraints for semisupervised clustering. Moreover, semi-supervised clustering algorithms imposes several challenges to be addressed, such as dealing with multi-density data, how to handle the evolving patterns that are important characteristics of streaming data with dynamic distributions, capable of performing fast and incremental processing of data objects, and suitably addressing time and memory limitations.In this thesis, we consider three main contributions. The first contribution of this thesis, we consider batch-mode active learning for semi-supervised clustering algorithms in an iterative manner. First, we select a batch of informative query instances such that the distribution represented by the selected query set and the available labeled data is closest to the distribution represented by the unlabeled data. Then, we query them with the existing neighborhoods to determine which neighborhood they belong. The experimental results with state-of-the-art methods on different real world dataset demonstrate the effectiveness and efficiency of the proposed method.In the second contribution of this thesis, we address the problem of streaming data. Data stream mining is an active research area that has recently emerged to discover knowledge from large amounts of continuously generated data. We propose an algorithm that extending Affinity Propagation(AP) to handle evolving data steam with dynamic distributions. We present a semisupervised clustering technique(SSAPStream) that incorporates labeled exemplars into the APalgorithm to deal with changes in the data distribution, which requires the stream model to be updated as soon as possible. The experimental results on synthetic and real data sets validate the effectiveness of our algorithm in handling dynamically evolving data streams. Also, we study the execution time and memory usage of SSAPStream, which are important efficiency factors for streaming algorithms.The third contribution of this thesis addresses the problem of clustering multi-density data and arbitrary shapes. Density-based clustering methods are the most important due to their high ability to detect arbitrary shaped clusters. Existing methods are based on DBSCAN which is a typical density-based clustering algorithm and its clustering performance depends on two specified parameters(Eps and Minpts) that define a single density. Most of existing methods are unsupervised, which cannot utilize the small number of prior knowledge. We propose a semisupervised clustering(called Semi Den) algorithm that discovers clusters of different densities and arbitrary shapes. The idea of the proposed algorithm is to partition the dataset into different density levels and compute the density parameters for each density level set. Then, use the pairwise constraints for expanding the clustering process based on the computed density parameters. Evaluating Semi Den algorithm on both synthetic and real datasets confirms that the proposed algorithm gives better results than other semi-supervised and unsupervised density based approaches.