论文部分内容阅读
记忆学习方法(Memory-Based Learning(MBL))将存储的训练数据作为获取的知识来使用,并通过相似性比较来完成分类任务,克服了词语一级自然语言处理中知识表示不足给机器学习知识获取带来的障碍。但自然语言的灵活性使MBL方法基于属性逻辑(attribute logic)的表示方法面临着较为严重的数据稀疏问题(data sparseproblem),这已经成为MBL方法应用于自然语言处理的主要瓶颈。本文正是针对这一问题,提出一种通过可信距离的判别机制将信息提取领域里文档表示方法的tf.idf词语权重计算引入到MBL中的改进方法。实验证明,我们提出的方法在保持原有训练集规模的情况下使正确率得到了较大的改进。
Memory-Based Learning (MBL) uses stored training data as acquired knowledge, and performs classification tasks by similarity comparison, overcomes the deficiency of knowledge representation in word-level natural language processing to machine learning knowledge Obtain obstacles. However, the natural language flexibility makes the MBL method based on attribute logic represent a serious data sparseproblem, which has become the main bottleneck for the application of MBL to natural language processing. In this paper, aiming at this problem, this paper proposes an improved method of introducing the tf.idf word weight calculation into the MBL by using the discriminative mechanism of trusted distance to extract the document expression method in the field of information extraction. Experimental results show that the proposed method can improve the correctness rate greatly while maintaining the original scale of training set.