论文部分内容阅读
为更好地挖掘文本信息,研究了将两步策略用于中文短文本分类的3个关键问题,提出了基于组合朴素贝叶斯(NB)和K近邻(KNN)分类器的两步中文短文本分类方法:(1)直接利用NB和KNN的输出构造其对应的二维空间,根据该空间内错误文本的分布将测试文本集分为3部分:能被KNN可靠分类的文本集A,不能被KNN可靠分类但能被NB可靠分类的文本集B,其他文本集C.(2)用KNN、NB分别对文本集A和B进行分类,根据训练语料的类别分布,直接给属于文本集C的文本分配标签.与NB、KNN和支持向量机(SVM)的对比实验表明,该方法可获得较高的分类性能.
In order to better mine the textual information, three key problems of using the two-step strategy for Chinese short text classification are studied. Two-step Chinese essay based on the combined Naive Bayes (NB) and K-Nearest Neighbor (KNN) The classification method: (1) directly use the output of NB and KNN to construct the corresponding two-dimensional space, according to the distribution of the error text in the space will be divided into three parts of the test text: can be reliably classified by KNN text set A, can not Which is reliably classified by KNN but which can be reliably classified by NB, and other texts C. (2) Classifies text sets A and B respectively by KNN and NB, and according to the class distributions of training corpus, directly belongs to the text set C The comparison experiments with NB, KNN and Support Vector Machine (SVM) show that this method can achieve higher classification performance.