论文部分内容阅读
Speech emotion recognition(SER)is taking great attention due to artificial intelligence’s impact on the research arena.It has revealed various fields of research in computer science aiming to improve the fashion on the interactions of humans and machines.One of the significant focuses in the recent research arena is the recognition of human emotions by machines.Therefore,this type of communication is associated with information sciences.Different techniques are being adopted for detecting emotional states in vocal expressions,native speaker recognition,etc.Thus,the SER aims to recognize the correct emotional state of a speaker.Emotions have fuzzy temporal boundaries.Its difficulties arise in various ways.Emotions are expressed differently for each human,and one utterance may contain more than one emotion.The large gap between accuracy rates achieved on the different types of datasets of speech raises questions about the way emotions modulate the speech.Aiming to highlight the challenges of emotion in speech,the architecture,and some of the key layers of SER are presented in this thesis.Furthermore,the main focus on seeking informative features for emotion classes use the deep neural network(DNN)and recurrent neural network(RNN).Both DNN and RNN have been used as a key solution to recognize models for speech emotion recognition.Moreover,various combinations of the CNN and RNN are proposed in the SER field.However,the convolutional recurrent neural network(CRNN)is demonstrated as robust architecture in the last decades.The CRNN is the combination of the convolutional neural network(CNN)and recurrent neural network(RNN).Subsequently,the pros and cons of each layer of the CRNN architecture are also demonstrated.Therefore,this thesis focuses on robust SER with the aid of deep CRNN along with the double LSTM layers also called stacked LSTM.The main objective of this thesis is to design and implement a robust architecture for emotion recognition in speech.Thus,by adopting deep learning techniques along with CRNN,a DCRNN architecture is designed and implemented for the robust SER.The DCRNN architecture includes the log-Mel spectrogram,CNN layer,double LSTM(i.e.,stacked LSTM),fully connected layer,and the softmax layer.The log-Mel spectrogram extracts speech signals by extending the given dimensionality of the spectrum.It uses linear frequency scaling,so each frequency bins are spaced as an equal number of Hertz apart.The primary investigation is done in the CNN layer,where the regression steps are trained to obtain the optimal kernel size and max pooling for the extracted speech features.Then the recurrent neural network(RNN)with two layers LSTM is adapted.LSTM can avoid long-term dependency problems and it does not vanish the gradient when trained with backpropagation through time for presenting the effective model.The two layers of LSTM have hidden layers within the neural network as an extractor for more high-level features.The primary reason for adopting the two layers of LSTM is to extract speech features with more accuracy compared with traditional LSTM.Afterwards,the fully connected layer offers learning features from all the combinations of features of the previous layer.The softmax layer responds to the probability of input of the network.As it is demonstrated in the experimental results,the DCRNN architecture outperforms the traditional learning architectures for the SER.