论文部分内容阅读
目的掌握警情的时空分布规律,通过机器学习算法建立警情时空预测模型,制定科学的警务防控方案,有效抑制犯罪的发生,是犯罪地理研究的重点。已有研究表明,警情时空分布多集中在中心城区或居民密集区,在时空上属于非平衡数据,这种数据的非平衡性通常导致在该数据上训练的模型成为弱学习器,预测精度较低。为解决这种非平衡数据的回归问题,提出一种基于KMeans均值聚类的Boosting算法。方法该算法以Boosting集成学习算法为基础,应用GA-BP神经网络生成基分类器,借助KMeans均值聚类算法进行基分类器的集成,从而实现将弱学习器提升为强学习器的目标。结果与常用的解决非平衡数据回归问题的Synthetic Minority Oversampling Technique Boosting算法,简称SMOTEBoosting算法相比,该算法具有两方面的优势:1)在降低非平衡数据中少数类均方误差的同时也降低了数据的整体均方误差,SMOTEBoosting算法的整体均方误差为2.14E-04,KMeans-Boosting算法的整体均方误差达到9.85E-05;2)更好地平衡了少数类样本识别的准确率和召回率,KMeans-Boosting算法的召回率约等于52%,SMOTEBoosting算法的召回率约等于91%;但KMeans-Boosting算法的准确率等于85%,远高于SMOTEBoosting算法的19%。结论 KMeans-Boosting算法能够显著的降低非平衡数据的整体均方误差,提高少数类样本识别的准确率和召回率,是一种有效地解决非平衡数据回归问题和分类问题的算法,可以推广至其他需要处理非平衡数据的领域中。
OBJECTIVE To grasp the spatial and temporal distribution of police intelligence, establish the prediction model of police intelligence space-time based on machine learning algorithms, formulate a scientific police prevention and control plan and effectively suppress the occurrence of crime, which is the focus of criminal geography research. Studies have shown that the spatiotemporal distribution of police intelligence is mostly concentrated in central urban areas or densely populated areas and belongs to non-equilibrium data in time and space. The unbalanced nature of such data often leads to the weak learner being trained on this data. The prediction accuracy Lower. To solve the problem of regression of this non-equilibrium data, a Boosting algorithm based on KMeans means clustering is proposed. Methods Based on the Boosting integrated learning algorithm, this algorithm uses GA-BP neural network to generate base classifiers and integrates base classifiers with KMeans average clustering algorithm so as to promote the weak learner to be a strong learner. Results Compared with the commonly used Synthetic Minority Oversampling Technique Boosting algorithm (SMOTEBoosting algorithm), this algorithm has two advantages: 1) reducing the mean square error of few classes The overall mean square error of the data, the overall mean square error of the SMOTEBoosting algorithm is 2.14E-04, and the mean square error of the KMeans-Boosting algorithm reaches 9.85E-05; 2) the accuracy of the minority class sample identification and Recall rate, KMeans-Boosting algorithm recall rate is equal to about 52%, SMOTEBoosting algorithm recall rate is equal to 91%; but KMeans-Boosting algorithm accuracy is equal to 85%, much higher than 19% SMOTEBoosting algorithm. Conclusion The KMeans-Boosting algorithm can significantly reduce the overall mean square error of non-equilibrium data and improve the accuracy and recall of minority samples. It is an effective algorithm to solve the problem of non-equilibrium data regression and classification, and can be extended to Other areas need to deal with non-equilibrium data.