论文部分内容阅读
语音数据资源是语音识别研究的基础。当前国内只有为数不多的开放的语音数据库供研究者免费使用,特别是在维吾尔语等少数民族语音识别方面,数据资源更为贫乏。该文发布一个完全免费的维吾尔语连续语音数据库,该数据库包括约20h的训练数据和1h的测试数据,同时介绍了构建维吾尔语语音识别系统所需要的音素集、词表、文本数据等相关资源,以及用于构建基线系统的脚本。给出了该基线系统在纯净测试数据和噪声测试数据上的识别性能。该数据库为维吾尔语语音识别研究提供了可以借鉴的标准数据库。
Speech data resources are the basis of speech recognition research. Currently, only a few open voice databases are available for researchers in China free of charge. In particular, data resources are even more scarce in ethnic minority languages such as Uyghur. The article publishes a completely free Uyghur continuous speech database, which includes about 20h of training data and 1h of test data. At the same time, it introduces related resources such as phoneme sets, vocabularies and text data needed for constructing Uyghur speech recognition systems , As well as the script used to build the baseline system. The recognition performance of the baseline system on pure test data and noise test data is given. The database provides a standard database for Uyghur speech recognition research.