论文部分内容阅读
Abstract:The content of the email is often very short,but the style of language is obvious.Therefore,we think the ideal in the sample case,part of the text style can be used to identify the author of the text.We use a short word mail in proportion,word species accounted for ratio,the average length of words,the mean and variance of lexical density and the maximum number of single use ratio as characteristic value,principal component analysis of these features,the final extract two principal components,which reflect the word density and vocabulary does not repeat,and then to the two principal components were used as independent variables and the dependent variables,the authors make different scatter diagram,found that these scattered point map has certain rules,can reflect the differences between the various authors,so we use the BP neural network model identification,to extract principal components as input features,with a four bit binary number As the author’s number,each author selects a certain number of mail to train.We find that when the learning rate is 0.01 and the hidden layer is 50,the test output is the best,and the correct rate of identification is 87.5%.
Key words:text feature;principal component analysis;scatter diagram;BP neural network pattern identification identification
I.Problem Analysis and Model Establishment
1.1 SPSS principal component analysis
The eigenvalues of the extracted are input into the SPSS,and the principal component analysis is used to reduce the dimension of the feature set.
It can be seen intuitively that there is a correlation between the variables,but it needs to be tested,and then the output is the correlation test:After the Bartlett sphericity test,the P value <0.001.combines two indexes,which shows the correlation between the variables,and can be analyzed by factor.we can see that the eigenvalues of components 1 and 2 are greater than 1,and they can explain 79.773% variance,which is pretty good.Therefore,we can extract 1 and 2 as principal components,and seize the main contradiction.
The eight picture the abscissa represents 2 main components,namely “the average sentence length recognition ability of the author”;the ordinate represents the principal component 1,namely “the proportion of total words for identifying the author through different words ability;relationship between each figure represent each author of the two kinds of ability;through SPSS we can see that these two kinds of ability of each author has some relations and differences obviously.Therefore,we can put these two components as input parameters of BP neural network training,and then identify the authors of the text. 1.2 The solution of neural network
We have two main components extracted as the input of neural network,as a four bit binary number to express the author’s name was S,so the choice of logarithmic function as the transfer function of output neurons.Through repeated testing,to determine the learning rate is 0.01,the maximum number of iterations for 10000 times,the hidden layer 50 layer.
After executing a large number of neural network algorithms,we found that among the eight selected authors,seven were basically identified.The accuracy rate reached 87.5%.We could think that this model could identify the author of the mail.We chose two distributed scatter diagrams as follows:
II.Conclusions
The lexical structure out of the model can reflect the characteristics of different authors in a certain extent,this paper proposes the method of vocabulary and structure established identification based on the identity of the mail author is effective.Through principal component analysis,plot analysis,we conclude that the lexical features we selected can be used to different authors,the recognition rate can reach 87.5%.in the process of training the BP neural network,we found that for the final accuracy of the test result the greatest impact is the number of hidden layers,visible and hidden layers is determined accurately BP neural network training is the key factor,followed by BP network learning rate will affect the learning effect.
III.References
[1]RuiHua Qi.Research on the identification of text authors[M].Beijing:Tsinghua University press,2017;
[2]Shuying Zhang、Ye Zhang.Implementation of pattern recognition and intelligent computing -Matlab Technology[M].Beijing:Electronic Industry Press,2015:138-191;
[3]G.U.Yule,The statistical study of literary vocabulary, Cambridge University Press,(1944);
[4]J.Moody and J.Utans, Architecture Selection Strategies for Neural Networks Application to Corporate Bond Rating, Neural Networks in the Capital Markets, (1995);
(作者單位:山东理工大学)
Key words:text feature;principal component analysis;scatter diagram;BP neural network pattern identification identification
I.Problem Analysis and Model Establishment
1.1 SPSS principal component analysis
The eigenvalues of the extracted are input into the SPSS,and the principal component analysis is used to reduce the dimension of the feature set.
It can be seen intuitively that there is a correlation between the variables,but it needs to be tested,and then the output is the correlation test:After the Bartlett sphericity test,the P value <0.001.combines two indexes,which shows the correlation between the variables,and can be analyzed by factor.we can see that the eigenvalues of components 1 and 2 are greater than 1,and they can explain 79.773% variance,which is pretty good.Therefore,we can extract 1 and 2 as principal components,and seize the main contradiction.
The eight picture the abscissa represents 2 main components,namely “the average sentence length recognition ability of the author”;the ordinate represents the principal component 1,namely “the proportion of total words for identifying the author through different words ability;relationship between each figure represent each author of the two kinds of ability;through SPSS we can see that these two kinds of ability of each author has some relations and differences obviously.Therefore,we can put these two components as input parameters of BP neural network training,and then identify the authors of the text. 1.2 The solution of neural network
We have two main components extracted as the input of neural network,as a four bit binary number to express the author’s name was S,so the choice of logarithmic function as the transfer function of output neurons.Through repeated testing,to determine the learning rate is 0.01,the maximum number of iterations for 10000 times,the hidden layer 50 layer.
After executing a large number of neural network algorithms,we found that among the eight selected authors,seven were basically identified.The accuracy rate reached 87.5%.We could think that this model could identify the author of the mail.We chose two distributed scatter diagrams as follows:
II.Conclusions
The lexical structure out of the model can reflect the characteristics of different authors in a certain extent,this paper proposes the method of vocabulary and structure established identification based on the identity of the mail author is effective.Through principal component analysis,plot analysis,we conclude that the lexical features we selected can be used to different authors,the recognition rate can reach 87.5%.in the process of training the BP neural network,we found that for the final accuracy of the test result the greatest impact is the number of hidden layers,visible and hidden layers is determined accurately BP neural network training is the key factor,followed by BP network learning rate will affect the learning effect.
III.References
[1]RuiHua Qi.Research on the identification of text authors[M].Beijing:Tsinghua University press,2017;
[2]Shuying Zhang、Ye Zhang.Implementation of pattern recognition and intelligent computing -Matlab Technology[M].Beijing:Electronic Industry Press,2015:138-191;
[3]G.U.Yule,The statistical study of literary vocabulary, Cambridge University Press,(1944);
[4]J.Moody and J.Utans, Architecture Selection Strategies for Neural Networks Application to Corporate Bond Rating, Neural Networks in the Capital Markets, (1995);
(作者單位:山东理工大学)