半监督聚类算法对于流和多密度数据

来源 :北京理工大学 | 被引量 : 1次 | 上传用户:bingdaogege
下载到本地 , 更方便阅读
声明 : 本文档内容版权归属内容提供方 , 如果您对本文有版权争议 , 可与客服联系进行内容授权或下架
论文部分内容阅读
Clustering is one of the most common data mining tasks, used frequently for data categorization and analysis in both industry and academia. In many domains where clustering is applied, some prior knowledge is available either in the form of labeled data(specifying the category to which an instance belongs) or pairwise constraints on some of the instances(specifying whether two instances should be in same or different clusters). The focus of our research is on semisupervised clustering, where we study how prior knowledge can be incorporated into clustering algorithms.Semi-supervised clustering aims to improve the clustering performance by considering user supervision in the form of pairwise constraints. However, most current algorithms are passive in the sense that pairwise constraints are provided beforehand and selected randomly. This may lead to the use of constraints that are redundant, unnecessary, or even harmful to the clustering results. For those reasons, we would like to optimize the selection of the constraints for semisupervised clustering. Moreover, semi-supervised clustering algorithms imposes several challenges to be addressed, such as dealing with multi-density data, how to handle the evolving patterns that are important characteristics of streaming data with dynamic distributions, capable of performing fast and incremental processing of data objects, and suitably addressing time and memory limitations.In this thesis, we consider three main contributions. The first contribution of this thesis, we consider batch-mode active learning for semi-supervised clustering algorithms in an iterative manner. First, we select a batch of informative query instances such that the distribution represented by the selected query set and the available labeled data is closest to the distribution represented by the unlabeled data. Then, we query them with the existing neighborhoods to determine which neighborhood they belong. The experimental results with state-of-the-art methods on different real world dataset demonstrate the effectiveness and efficiency of the proposed method.In the second contribution of this thesis, we address the problem of streaming data. Data stream mining is an active research area that has recently emerged to discover knowledge from large amounts of continuously generated data. We propose an algorithm that extending Affinity Propagation(AP) to handle evolving data steam with dynamic distributions. We present a semisupervised clustering technique(SSAPStream) that incorporates labeled exemplars into the APalgorithm to deal with changes in the data distribution, which requires the stream model to be updated as soon as possible. The experimental results on synthetic and real data sets validate the effectiveness of our algorithm in handling dynamically evolving data streams. Also, we study the execution time and memory usage of SSAPStream, which are important efficiency factors for streaming algorithms.The third contribution of this thesis addresses the problem of clustering multi-density data and arbitrary shapes. Density-based clustering methods are the most important due to their high ability to detect arbitrary shaped clusters. Existing methods are based on DBSCAN which is a typical density-based clustering algorithm and its clustering performance depends on two specified parameters(Eps and Minpts) that define a single density. Most of existing methods are unsupervised, which cannot utilize the small number of prior knowledge. We propose a semisupervised clustering(called Semi Den) algorithm that discovers clusters of different densities and arbitrary shapes. The idea of the proposed algorithm is to partition the dataset into different density levels and compute the density parameters for each density level set. Then, use the pairwise constraints for expanding the clustering process based on the computed density parameters. Evaluating Semi Den algorithm on both synthetic and real datasets confirms that the proposed algorithm gives better results than other semi-supervised and unsupervised density based approaches.
其他文献
随着高校扩招和大学生考试考证压力的增加,大学生占位现象俨然已经成为高校一道独特的风景。本文以郑州大学为例对大学生“占位”现状进行简要的陈述并简要分析其原因,并针对该
实践教学是高职院校培养技术应用型人才的一个关键因素,科学有效的实践教学是高职物流管理人才培养高水平的有力保证。本文针对目前高职物流专业实践教学指出了存在的一些问
本文扼要阐述了弹性元件的特性,强度,刚度计算,较详细介绍了振动冲击夯弹性元件的设计方法及计算机辅助设计程序。
高校图书馆是大学生获取资源的重要途径,也承担着保存科研文献的重任。以往的图书馆相比,现在的图书馆信息量更大,如何科学的管理直接影响着学校的教学和科研。大部分高校都将现
本文提出了当风载荷和上部不平衡力矩,回转惯性力,以及由风载荷和回惯性力引起的扭矩,同时作用在附着框上时,可变方向的合作力引起的各撑杆最大内力的计算方法。
为了防止全球气候变暖、达到京都协议中承诺的目标,统一欧洲内部市场,欧盟制定了一系列的法律法规来约束和统一各国的行为。2006年5月又出台了关于汽车空调系统排放物的指令
会计信息化教学是近年来新兴的一种教学方式,它可以让中职会计教学取得更好的成果,本文主要论述会计教学的信息化发展,指出中职会计信息化的现状及优势,找出其问题所在,并给出相应
农村教育是我国教育的重点和难点,然而当前我国农村教育中失学问题严重。本文拟从个人教育成本—收益模型入手对我国农村教育中失学问题进行研究,个人教育的成本—收益率是我
文章根据教师发展基本途径,提出合作学习是目前较为实际和有效的高职教师专业发展模式,并分析了高职教师通过各种合作学习促进专业发展的基本要求和主要形式,并提出教师发展的必
实行双语教学不仅是高校教改的一项重要内容,而且是提高教学质量的一种有效手段。在《国际经济学》课程双语教学过程中常常会遇到学生英语基础参差不齐、师资队伍培养、缺乏双