论文部分内容阅读
根据领域性较强的网站往往蕴含大量平行或可比较双语样本这一特点,针对特定领域双语网站的自动识别问题,提出了一种基于全局搜索和局部分类的方法。以电子器件领域为目标,采用全局搜索方法获得该领域双语网站18 944个,随机抽取其中3 000个网站进行人工标注,在标注语料上,采用局部分类方法识别该领域双语网站的性能(F值)达到85.19%。在此基础上,利用识别出的目标领域双语网站中的双语句对,扩充特定领域机器翻译系统的训练集进行实验。实验结果表明,相同测试集下,特定领域机器翻译系统的性能获得显著提升,验证了本文所提出的自动识别特定领域双语网站方法的有效性。
According to the fact that strong domain websites often contain a large number of parallel or comparable bilingual samples, this paper proposes a global search and local classification method for the automatic identification of bilingual websites in specific fields. In the field of electronic devices, 18 944 bilingual websites in this field were obtained by using the global search method, and 3 000 websites were randomly selected for manual annotation. On the annotation corpus, the local classification method was used to identify the performance of bilingual websites in the field (F value ) Reached 85.19%. On the basis of this, experiments are carried out by using the bilingual sentence pairs in the bilingual websites of the target areas and expanding the training set of the machine translation system in a specific field. The experimental results show that under the same test set, the performance of the machine translation system in a specific field is significantly improved, which verifies the effectiveness of the proposed method for automatically identifying bilingual websites in a particular field.