论文部分内容阅读
文本分类是指在给定的分类系统下,根据文本的内容或属性,将大量文本归到一个或多个类别的过程。随着煤炭产业的高速发展,煤炭数据库中保存了大量煤炭采集文本数据,针对如此大规模的文本信息,传统的SVM算法不能很好地对大规模海量煤炭文本数据进行有效的处理。文中基于现有流行的Hadoop分布式计算平台,提出了分布式SVM文本分类算法。通过实验表明,文中提出的算法能够明显减小文本分类时间,并且具有很好的可扩展性。
Text categorization refers to the process of grouping large amounts of text into one or more categories, depending on the content or attributes of the text, for a given classification system. With the rapid development of the coal industry, a large amount of coal-collected text data is saved in the coal database. For such a large-scale textual information, the traditional SVM algorithm can not effectively process large-volume coal text data effectively. Based on the existing popular Hadoop distributed computing platform, a distributed SVM text classification algorithm is proposed. Experiments show that the algorithm proposed in this paper can reduce the time of text classification obviously and has good scalability.