论文部分内容阅读
决策树构造过程中的属性选择标准一直是数据挖掘领域的研究热点。本研究在分析ID3算法和C4.5算法属性选择策略的基础上,基于通信系统中的平均自信息与平均互信息提出了两种决策树的构造算法。研究过程中从理论证明了所提出的两种算法与ID3算法以及C4.5算法是等价的,即,信息增益等价于通信系统中的平均互信息,而信息增益率等价于通信系统中平均互信息与平均自信息的比值。在AllElectronics数据集进行的实验表明:与信息增益和信息增益率相比,本研究提出的属性选择标准具有计算方便、且容易理解的特点。
Attribute selection criteria in the construction of decision trees have always been the hotspot in the field of data mining. Based on the analysis of attribute selection strategy of ID3 algorithm and C4.5 algorithm, this paper proposes two algorithms for constructing decision trees based on average self-information and average mutual information in communication system. The research proves that the proposed two algorithms are equivalent to ID3 algorithm and C4.5 algorithm, that is, the information gain is equivalent to the average mutual information in the communication system, while the information gain rate is equivalent to the communication system The ratio of average mutual information to average self-information. Experiments conducted on the AllElectronics dataset show that compared with the information gain and the information gain rate, the attribute selection criteria proposed in this study has the characteristics of easy calculation and easy comprehension.