论文部分内容阅读
采用通用后缀树模型(GSTM),利用邮件内容的上下文信息,进行每个文本位置的不定长多元统计,从而获得被测邮件与不同训练集的相似程度,确定邮件所属的类别。理论分析和实验表明,在相同语料上,该方法的精确度和召回率均达到或超过了基于向量空间模型的邮件过滤方法;对于长度为N的邮件,过滤时间为O(N);长度为N的新邮件加入训练集,训练时间为O(N),满足了训练集的动态增长;该方法不需进行分词处理,完全独立于语种,适用于多语种邮件同时存在的情况。
By using the universal suffix tree model (GSTM), the context information of the message content is used to carry out the variable-length multivariate statistics of each text position to obtain the similarity degree between the measured mail and different training sets and to determine the category to which the mail belongs. Theoretical analysis and experiments show that the accuracy and recall rate of the proposed method meet or exceed the mail filtering method based on vector space model. For the length N message, the filtering time is O (N) and the length is N is added to the training set, the training time is O (N), which satisfies the dynamic growth of the training set. This method does not need word processing and is completely independent of the language. It is suitable for the simultaneous existence of multilingual mail.