论文部分内容阅读
如何通过双语平行语料库提取语言之间的语义对信息,对改善跨语言信息检索的性能有着十分重要的意义.双语平行文档拥有相同的主题,这些双语主题在具体模型上可体现为语义相关.本文首先将双语平行文档看作同一语义内容的两种语言表示,从双语平行语料库构造每种语言的潜在语义空间,从而提出一种新的双语主题模型,即双语偏最小二乘主题相关模型.新模型克服了跨语言潜在语义索引模型没有充分考虑双语语义联系的不足.在中英双语新闻语料集上实验结果显示,新模型的文档配对搜索和伪查询跨语言检索性能明显优于跨语言潜在语义索引模型;在使用Google翻译得到的TREC-9双语平行语料库上,新模型也获得了较好的检索性能.
How to extract semantic information between languages through bilingual parallel corpora is very important to improve the performance of cross-language information retrieval.Bilingual parallel documents have the same topics, which can be embodied as semantic correlations in specific models. First of all, bilingual parallel documents are regarded as two linguistic representations of the same semantic content, and a potential bilingual semantic space for each language is constructed from a bilingual parallel corpus, so as to propose a new bilingual theme model, ie, a bilingual partial least squares theme correlation model. The model overcomes the shortcomings of the cross-language latent semantic index model without fully considering bilingual semantic relations.Experimental results on the Chinese-English bilingual news corpus show that the performance of document paired search and pseudo-query cross-language retrieval in the new model is significantly better than cross-language latent semantic Index model. The new model also achieved better retrieval performance on the TREC-9 bilingual corpus using Google Translate.