论文部分内容阅读
以维吾尔语为代表的低资源、形态丰富语言的信息处理对于满足“一带一路”语言互通的战略需求具有重要意义。这类语言通过组合语素来表示句法和语义关系,因而给语言处理带来严重的数据稀疏问题。该文提出基于双向门限递归单元神经网络的维吾尔语形态切分方法,将维吾尔词自动切分为语素序列,从而缓解数据稀疏问题。双向门限递归单元神经网络能够充分利用双向上下文信息进行切分消歧,并通过门限递归单元有效处理长距离依赖。实验结果表明,该方法相比主流统计方法和单向门限递归单元神经网络获得了显著的性能提升。该方法具有良好的语言无关性,能够用于处理更多的形态丰富语言。
The information processing of low-resource, morphologically rich languages represented by Uyghur is of great significance for meeting the strategic needs of the “One Belt and One Road” language interoperability. Such languages express syntactic and semantic relations through the combination of morphemes, thus posing serious data sparseness issues for language processing. In this paper, a Uyghur morphological segmentation method based on bi-directional threshold recursive unit neural network is proposed. The Uyghur word is automatically segmented into the morpheme sequence to alleviate the data sparseness problem. Two-way threshold recursive unit neural network can make full use of bi-directional context information for segmentation and disambiguation, and effectively handle long-distance dependence through threshold recursive unit. The experimental results show that this method achieves significant performance improvement over the mainstream statistical methods and the one-way threshold recursive unit neural network. This method has good language independence and can be used to handle more morphologically rich languages.