论文部分内容阅读
Department of Electronics and Communication Engineering, Beijing Institute of Technology, Haidian-Beijing, China
Abstract:In speech recognition,language identification of speech refers to the process of using a computer to identify a language of a spoken utterance automatically. It has played more and more important role in applications currently such as multilingual conversation system,spoken language,translation system and multilingual information retrieval system [1]. The main task of a language identifier is to design an efficient algorithm which helps a machine to identify correctly a particular language from a given audio sample. Researchers have given a lot of emphasis in the task of Language identification and over the last two decades there has been significant progress in this area. When the task of LID is performed, we can identify a particular language better than machines if the language is familiar to us. With the use of machine learning a computer can be trained properly so that it can identify as many numbers of languages given to it as input whereas human beings can identify at most 10–15 languages [2]. In this paper, I have discussed language identification using mat lab programming for three languages based on our standard database. After extracting set of features using (MFCC) I have done training using vector quantization and finally for better classification i used GMM.
Keywords: MFCC,GMM,VQ,K-SVD
1.Introduction
Today,when we call most large companies,a person doesn't usually answer the phone. Instead,an automated voice recording answers and instructs you to press buttons to move through option menus. Many companies have moved beyond requiring you to press buttons,though. Often you can just speak certain words (again,as instructed by a recording) to get what you need. The system that makes this possible is a type of speech recognition program -- an automated phone system.
People with disabilities that prevent them from typing have also adopted speech-recognition systems. If a user has lost the use of his hands,or for visually impaired users when it is not possible or convenient to use a Braille keyboard,the systems allow personal expression through dictation as well as control of many computer tasks. Some programs save users' speech data after every session,allowing people with progressive speech deterioration to continue to dictate to their computers [3].
Current programs fall into two categories: Small-vocabulary/many-users,these systems are ideal for automated telephone answering. The users can speak with a great deal of variation in accent and speech patterns, and the system will still understand them most of the time. However,usage is limited to a small number of predetermined commands and inputs,such as basic menu options or numbers.
Large-vocabulary/limited-users,
These systems work best in a business environment where a small number of users will work with the program. While these systems work with a good degree of accuracy (85 percent or higher with an expert user) and have vocabularies in the tens of thousands,you must train them to work best with a small number of primary users. The accuracy rate will fall drastically with any other user.
Speech recognition systems made more than 10 years ago also faced a choice between discrete and continuous speech. It is much easier for the program to understand words when we speak them separately, with a distinct pause between each one. However,most users prefer to speak in a normal, conversational speed. Almost all modern systems are capable of understanding continuous speech.
How it Works
To convert speech to on-screen text or a computer command,a computer has to go through several complex steps. When we speak,we create vibrations in the air. The analog-to-digital converter (ADC) translates this analog wave into digital data that the computer can understand. To do this, it samples,or digitizes,the sound by taking precise measurements of the wave at frequent intervals. The system filters the digitized sound to remove unwanted noise,and sometimes to separate it into different bands of frequency (frequency is the wavelength of the sound waves,heard by humans as differences in pitch). It also normalizes the sound, or adjusts it to a constant volume level. It may also have to be temporally aligned. People don't always speak at the same speed,so the sound must be adjusted to match the speed of the template sound samples already stored in the system's memory.
Speech Recognition and Statistical Modeling
Early speech recognition systems tried to apply a set of grammatical and syntactical rules to speech. If the words spoken fit into a certain set of rules,the program could determine what the words were. However,human language has numerous exceptions to its own rules,even when it's spoken consistently. Accents,dialects and mannerisms can vastly change the way certain words or phrases are spoken. Imagine someone from Boston saying the word "barn." He wouldn't pronounce the "r" at all,and the word comes out rhyming with "John." Or consider the sentence,"I'm going to see the ocean." Most people don't enunciate their words very carefully. The result might come out as "I'm goin' da see tha ocean." They run several of the words together with no noticeable break,such as "I'm goin'" and "the ocean." Rules-based systems were unsuccessful because they couldn't handle these variations. This also explains why earlier systems could not handle continuous speech -- you had to speak each word separately,with a brief pause in between them. How to extract feature vector (MFCC)
In general a feature vector is a list of values (numbers) that contain the relevant features to our signal for some specific task (here, use as input to a speech recognition algorithm) in some efficient and expressive way.
Some concrete examples: Suppose that at the first step of our procedure, we divide our audio signal (say,24khz,mono signal) in frames( fragments of fixed length,say 50ms)We are going now to build an appropriate ‘feature vector’ for each of the this frames.
A frame here is composed of 1200samples,which we store (say) in a row matrix in MATLAB. Well we could consider this matrix already as a ‘feature vector’ (it certainly represents the audio signal: each number is the audio intensity as a function of time. But this ‘trivial’ vector is not very appropriate; because they are too many numbers and because they aren’t in themselves very ‘expressive’. We want to distinguish a vowel from a consonant,for example and this 1200 numbers say little about it;the same speaker saying the same vowel will probably produce such vector that are very different. We don’t want that
A first transformation,that will give us a more useful feature vector is the FOURIER TRANSFORM (or rather,the spectrogram) of audio. We probably have a basic idea (from music graphic equalizers,etc). Instead of having a list ‘vector’ of 1200samples (each for an instant of time) .We now have a ‘vector;of say 128numbers that tell us how much energy the audio has in each ‘frequency band’ (always inside the frame).This is more efficient(less numbers) and expressive perhaps we can start roughly distinguishing vowels and consonants,male and female voices,etc just by looking at this numbers.
From this,other transformation follows (MEL;change the scale of the frequencies: CEPSTRUM: log followed by inverse Fourier transform or rather DCT,conceptually Similar here.) And finally we trim the least important elements from our feature vector. Each step gives us a different feature vector, hopefully more efficient/expressive than the previous one. The whole procedure can sound a little complex and esoteric if not familiar with all this. But conceptually from the point of understanding what means to compute a suitable ‘feature vector’. These last steps are conceptually analogous with the first one
The next phase of this paper is our decision on which algorithm is suitable for language identification, which we have chosen GMM (Gaussian Mixture Model). NB We use absolute energy (1),MFCCs (12) (often referred to as absolute MFCC),first and second order derivatives of these absolute MFCCs to get a basic 39 dimension MFCC
13 Absolute Energy(1) and MFCC(12)
13 Delta First order derivatives of the thirteen absolute coeffficients
13 Delta-Delta Second order derivatives of the thirteen absolute coefficients
39 Total Basic MFCC Front End
Gaussian mixture model
GMM works on the EM optimization technique which is a clustering based learning process. While doing classification using GMM, EM (Expectation Maximization) algorithm runs in the background for finding maximum-likelihood parameter estimation and it does many to one mapping from an underlying distribution. EM algorithm consists of two major steps: E (Expectation) step followed by M (Maximization) step[2]. The Expectation step is done with respect to unknown underlying variables using current estimate of parameters and conditioned upon observation. The maximization step provides new estimate of parameters and both the steps are iterated until convergence is achieved as shown in below. For d dimensions, the Gaussian distribution of a given
Vector x is defined by:
choose initial parameter set
Where μ is the mean vector and Σ is the covariance matrix of the Gaussian. The probability given in a mixture of N Gaussians is:
Where N is the number of Gaussians and wi is the weight of Gaussian i, with
Selection of the number of number of Gaussian mixtures is very essential for designing a good GMM system. For example,in the hybrid as well as in GMM experiments that we have considered here,we have selected Gaussian mixture model up to size 1024. GMM classification method is used in image recognition,computer vision and speech recognition[2].
During recognition an unknown utterance is compare to each of the GMM’s. The likelihood that the unknown utterance was spoken in the same language as the speech used to train each model is computed and the most likely model is determined as the hypothesized language.
Experiment Setup
Archietecture of our System
Database;Words taking for training and testing the three languages
Conclusion
Our audio data is being recorded and saved in a folder.
We then use MFCC feature extraction approach to extract our necessary information from the folder with our audio data.
After that we use our algorithm of choice which is GMM to train my models by using our extracted and save the MFCC data. For our recognition step,both the MFCC feature vectors and GMM trained models make a comparison to find similarities in their properties. This is done by using vector quantization. It is a process to search for a similar feature vector to represent the input feature vector. A codebook is constructed in advance by collecting and processing sufficient number of feature vectors. This step utilizes K-SVD’s k-means clustering method to recognize the similarities.
In short the recognition is a simple step. We just look at our results in our command window,if they do not coordinate with our audio data,we can simply conclude by saying the system couldn’t recognize our audio data. On the other hand if they do correspond, we say it recognized it.
We can improve our accuracy rate by doing re-doing our recording very slowly or clearly. Also, retraining models and the background noise is something we need to take into consideration if we want to get a very efficient and high result. Below is a list of things we should consider for efficient accuracy.
Interface of experiment results showing:-
Extraction, Training, and Recognition processing and Results in Mat lab.
Weaknesses and Flaws
No speech recognition system is 100 percent perfect; several factors can reduce accuracy. Some of these factors are issues that continue to improve as the technology improves. Others can be lessened -- if not completely corrected -- by the user[3]. The flaws and weakness below are factors we need to take into consideration.
Low signal-to-noise ratio
Overlapping speech
Intensive use of computer power
Homonyms
References:
[1] Automatic Language Identification of Telephone Speech Marc A. Zissman
[2] A hybrid VQ-GMM approach for identifying Indian languages Pinki Roy ? Pradip K. Das
[3] How speech recognition works http://electronics.howstuffworks.com/gadgets/high-tech-gadgets/speech-recognition.htm
Abstract:In speech recognition,language identification of speech refers to the process of using a computer to identify a language of a spoken utterance automatically. It has played more and more important role in applications currently such as multilingual conversation system,spoken language,translation system and multilingual information retrieval system [1]. The main task of a language identifier is to design an efficient algorithm which helps a machine to identify correctly a particular language from a given audio sample. Researchers have given a lot of emphasis in the task of Language identification and over the last two decades there has been significant progress in this area. When the task of LID is performed, we can identify a particular language better than machines if the language is familiar to us. With the use of machine learning a computer can be trained properly so that it can identify as many numbers of languages given to it as input whereas human beings can identify at most 10–15 languages [2]. In this paper, I have discussed language identification using mat lab programming for three languages based on our standard database. After extracting set of features using (MFCC) I have done training using vector quantization and finally for better classification i used GMM.
Keywords: MFCC,GMM,VQ,K-SVD
1.Introduction
Today,when we call most large companies,a person doesn't usually answer the phone. Instead,an automated voice recording answers and instructs you to press buttons to move through option menus. Many companies have moved beyond requiring you to press buttons,though. Often you can just speak certain words (again,as instructed by a recording) to get what you need. The system that makes this possible is a type of speech recognition program -- an automated phone system.
People with disabilities that prevent them from typing have also adopted speech-recognition systems. If a user has lost the use of his hands,or for visually impaired users when it is not possible or convenient to use a Braille keyboard,the systems allow personal expression through dictation as well as control of many computer tasks. Some programs save users' speech data after every session,allowing people with progressive speech deterioration to continue to dictate to their computers [3].
Current programs fall into two categories: Small-vocabulary/many-users,these systems are ideal for automated telephone answering. The users can speak with a great deal of variation in accent and speech patterns, and the system will still understand them most of the time. However,usage is limited to a small number of predetermined commands and inputs,such as basic menu options or numbers.
Large-vocabulary/limited-users,
These systems work best in a business environment where a small number of users will work with the program. While these systems work with a good degree of accuracy (85 percent or higher with an expert user) and have vocabularies in the tens of thousands,you must train them to work best with a small number of primary users. The accuracy rate will fall drastically with any other user.
Speech recognition systems made more than 10 years ago also faced a choice between discrete and continuous speech. It is much easier for the program to understand words when we speak them separately, with a distinct pause between each one. However,most users prefer to speak in a normal, conversational speed. Almost all modern systems are capable of understanding continuous speech.
How it Works
To convert speech to on-screen text or a computer command,a computer has to go through several complex steps. When we speak,we create vibrations in the air. The analog-to-digital converter (ADC) translates this analog wave into digital data that the computer can understand. To do this, it samples,or digitizes,the sound by taking precise measurements of the wave at frequent intervals. The system filters the digitized sound to remove unwanted noise,and sometimes to separate it into different bands of frequency (frequency is the wavelength of the sound waves,heard by humans as differences in pitch). It also normalizes the sound, or adjusts it to a constant volume level. It may also have to be temporally aligned. People don't always speak at the same speed,so the sound must be adjusted to match the speed of the template sound samples already stored in the system's memory.
Speech Recognition and Statistical Modeling
Early speech recognition systems tried to apply a set of grammatical and syntactical rules to speech. If the words spoken fit into a certain set of rules,the program could determine what the words were. However,human language has numerous exceptions to its own rules,even when it's spoken consistently. Accents,dialects and mannerisms can vastly change the way certain words or phrases are spoken. Imagine someone from Boston saying the word "barn." He wouldn't pronounce the "r" at all,and the word comes out rhyming with "John." Or consider the sentence,"I'm going to see the ocean." Most people don't enunciate their words very carefully. The result might come out as "I'm goin' da see tha ocean." They run several of the words together with no noticeable break,such as "I'm goin'" and "the ocean." Rules-based systems were unsuccessful because they couldn't handle these variations. This also explains why earlier systems could not handle continuous speech -- you had to speak each word separately,with a brief pause in between them. How to extract feature vector (MFCC)
In general a feature vector is a list of values (numbers) that contain the relevant features to our signal for some specific task (here, use as input to a speech recognition algorithm) in some efficient and expressive way.
Some concrete examples: Suppose that at the first step of our procedure, we divide our audio signal (say,24khz,mono signal) in frames( fragments of fixed length,say 50ms)We are going now to build an appropriate ‘feature vector’ for each of the this frames.
A frame here is composed of 1200samples,which we store (say) in a row matrix in MATLAB. Well we could consider this matrix already as a ‘feature vector’ (it certainly represents the audio signal: each number is the audio intensity as a function of time. But this ‘trivial’ vector is not very appropriate; because they are too many numbers and because they aren’t in themselves very ‘expressive’. We want to distinguish a vowel from a consonant,for example and this 1200 numbers say little about it;the same speaker saying the same vowel will probably produce such vector that are very different. We don’t want that
A first transformation,that will give us a more useful feature vector is the FOURIER TRANSFORM (or rather,the spectrogram) of audio. We probably have a basic idea (from music graphic equalizers,etc). Instead of having a list ‘vector’ of 1200samples (each for an instant of time) .We now have a ‘vector;of say 128numbers that tell us how much energy the audio has in each ‘frequency band’ (always inside the frame).This is more efficient(less numbers) and expressive perhaps we can start roughly distinguishing vowels and consonants,male and female voices,etc just by looking at this numbers.
From this,other transformation follows (MEL;change the scale of the frequencies: CEPSTRUM: log followed by inverse Fourier transform or rather DCT,conceptually Similar here.) And finally we trim the least important elements from our feature vector. Each step gives us a different feature vector, hopefully more efficient/expressive than the previous one. The whole procedure can sound a little complex and esoteric if not familiar with all this. But conceptually from the point of understanding what means to compute a suitable ‘feature vector’. These last steps are conceptually analogous with the first one
The next phase of this paper is our decision on which algorithm is suitable for language identification, which we have chosen GMM (Gaussian Mixture Model). NB We use absolute energy (1),MFCCs (12) (often referred to as absolute MFCC),first and second order derivatives of these absolute MFCCs to get a basic 39 dimension MFCC
13 Absolute Energy(1) and MFCC(12)
13 Delta First order derivatives of the thirteen absolute coeffficients
13 Delta-Delta Second order derivatives of the thirteen absolute coefficients
39 Total Basic MFCC Front End
Gaussian mixture model
GMM works on the EM optimization technique which is a clustering based learning process. While doing classification using GMM, EM (Expectation Maximization) algorithm runs in the background for finding maximum-likelihood parameter estimation and it does many to one mapping from an underlying distribution. EM algorithm consists of two major steps: E (Expectation) step followed by M (Maximization) step[2]. The Expectation step is done with respect to unknown underlying variables using current estimate of parameters and conditioned upon observation. The maximization step provides new estimate of parameters and both the steps are iterated until convergence is achieved as shown in below. For d dimensions, the Gaussian distribution of a given
Vector x is defined by:
choose initial parameter set
Where μ is the mean vector and Σ is the covariance matrix of the Gaussian. The probability given in a mixture of N Gaussians is:
Where N is the number of Gaussians and wi is the weight of Gaussian i, with
Selection of the number of number of Gaussian mixtures is very essential for designing a good GMM system. For example,in the hybrid as well as in GMM experiments that we have considered here,we have selected Gaussian mixture model up to size 1024. GMM classification method is used in image recognition,computer vision and speech recognition[2].
During recognition an unknown utterance is compare to each of the GMM’s. The likelihood that the unknown utterance was spoken in the same language as the speech used to train each model is computed and the most likely model is determined as the hypothesized language.
Experiment Setup
Archietecture of our System
Database;Words taking for training and testing the three languages
Conclusion
Our audio data is being recorded and saved in a folder.
We then use MFCC feature extraction approach to extract our necessary information from the folder with our audio data.
After that we use our algorithm of choice which is GMM to train my models by using our extracted and save the MFCC data. For our recognition step,both the MFCC feature vectors and GMM trained models make a comparison to find similarities in their properties. This is done by using vector quantization. It is a process to search for a similar feature vector to represent the input feature vector. A codebook is constructed in advance by collecting and processing sufficient number of feature vectors. This step utilizes K-SVD’s k-means clustering method to recognize the similarities.
In short the recognition is a simple step. We just look at our results in our command window,if they do not coordinate with our audio data,we can simply conclude by saying the system couldn’t recognize our audio data. On the other hand if they do correspond, we say it recognized it.
We can improve our accuracy rate by doing re-doing our recording very slowly or clearly. Also, retraining models and the background noise is something we need to take into consideration if we want to get a very efficient and high result. Below is a list of things we should consider for efficient accuracy.
Interface of experiment results showing:-
Extraction, Training, and Recognition processing and Results in Mat lab.
Weaknesses and Flaws
No speech recognition system is 100 percent perfect; several factors can reduce accuracy. Some of these factors are issues that continue to improve as the technology improves. Others can be lessened -- if not completely corrected -- by the user[3]. The flaws and weakness below are factors we need to take into consideration.
Low signal-to-noise ratio
Overlapping speech
Intensive use of computer power
Homonyms
References:
[1] Automatic Language Identification of Telephone Speech Marc A. Zissman
[2] A hybrid VQ-GMM approach for identifying Indian languages Pinki Roy ? Pradip K. Das
[3] How speech recognition works http://electronics.howstuffworks.com/gadgets/high-tech-gadgets/speech-recognition.htm