论文部分内容阅读
Abstract: The temporal distance between events conveys information essential for many time series tasks such as speech recognition and rhythm detection. While traditional models such as hidden Markov models (HMMs) and discrete symbolic grammars tend to discard such information, recurrent neural networks (RNNs) can in principle learn to make use of it. As an advanced variant of RNNs, long short?term memory (LSTM) has an alternative (arguably better) mechanism for bridging long time lags. We propose a couple of deep neural network?based models to detect abnormal start?ups, unusual CPU and memory consumptions of the application processes running on smart phones. Experiment results showed that the proposed neural networks achieve remarkable performance at some reasonable computational cost. The speed advantage of neural networks makes them even more competitive for the applications requiring real?time response, offering the proposed models the potential for practical systems.
Keywords: deep learning; time series analysis; convolutional neural network; RNN
1 Introduction
eep learning algorithm emerged as a successful machine learning technique a few years ago. With the deep architectures, it became possible to learn high?level (compact) representations, each of which combines features at lower levels in an exponential and hierarchical way [1]-[3]. A stack of representation layers, learned from the data in order to optimize the given objective, make deep neural networks gain advantages such as generalization to unknown examples [4], discovering disentangling factors of variation and sharing learned representations among multiple tasks [5]. The recent successes of the deep convolutional neural networks (CNNs) are mainly based on such ability to learn hierarchical representation for spatial data [6]. For modeling temporal data, the recent resurgence of recurrent neural networks (RNN) has led to remarkable advances [6]-[11]. Unlike the spatial data, learning both hierarchical and temporal representation is one of the long?standing challenges for RNNs in spite of the fact that hierarchical structures naturally exist in many temporal data [12]-[15].
Forecasting future values of the observed time series in fact plays an important role in nearly all fields of science and engineering, such as economics, finance, business intelligence, and industrial applications. There has been extensive research on using machine learning techniques for time?series forecasting. Several machine learning algorithms were presented to tackle time series forecasting problem, such as multilayer perceptron, Bayesian neural networks, K?nearest neighbor regression, support vector regression, and Gaussian processes [16]. The effectiveness of local learning techniques is explored for dealing with temporal data [17]. In this study, we tried to detect abnormal start?ups, unusual CPU and memory consumptions of the application processes running on smart phones using RNNs, which falls in line with the recent efforts to analyze time series data in order to extract meaningful statistics and other characteristics of the data with the deep learning approach. The paper is organized as follows. First, we give an overview of the research goals, and then try to convey an intuition of the key ideas in Section 2. Section 3 presents RNN?based models to detect unusual CPU, memory consumptions and abnormal start?ups of the application process running on the smart phones. Section 4 reports results of a number of experiments on real?life devices and shows the effectiveness of the proposed models. The conclusion and future work are summarized in Section 5.
2 Background
Recurrent neural networks are a class of artificial neural networks that possess internal state or short?term memory due to recurrent feed?back connections that make them suitable for dealing with sequential tasks, such as speech recognition, prediction and generation [18]-[20]. Traditional RNNs trained with stochastic gradient?descent (SGD) have difficulty learning long?term dependencies (i.e. spanning more than ten time?steps lag) encoded in the input sequences due to vanishing gradient [21]. This problem has been partly addressed by using a specially designed neuron structure, or cell, in long short?term memory (LSTM) networks [21], [22] that keeps constant backward flow in the error signal; second?order optimization methods [23] preserve the gradients by approximating their curvature; or using informed random initialization [24] which allows for training the networks with momentum and stochastic gradient?descent only.
In conventional LSTM each gate receives connections from the input units and the outputs of all cells, but there is no direct connection from the Constant Error Carrousel (CEC) it is supposed to control. All it can observe directly is the cell output, which is close to zero as long as the output gate is closed. The same problem occurs for multiple cells in a memory block: when the output gate is closed none of the gates has access to the CECs they control. The resulting lack of essential information may harm network performance. Gers, Schraudolph, and Schmidhuber [25] suggested adding weighted “peephole” connections from the CEC to the gates of the same memory block. The peephole connections allow all gates to inspect the current cell state even when the output gate is closed. The information can be essential for finding well?working network solutions. During learning no error signals are propagated back from gates via peephole connections to the CEC. Peephole connections are treated like regular connections to gates except for updating timing. Gated recurrent units (GRUs) are a gating mechanism in recurrent neural networks (a variant of LSTM), introduced by Cho et al. [9]. Their performance on polyphonic music modeling and speech signal modeling was found to be similar to that of long short?term memory. GRUs have been shown to exhibit better performance on smaller datasets because they have fewer parameters than LSTM, as they lack an output gate. There are several variations on the full gated units, with gating done using the previous hidden state and the bias in different combinations.
3 Deep Neural Network?Based Models
Two deep neural network?based models will be described in this section. Both tasks mentioned above are time series forecasting, which uses a model to predict future values based on previously observed values. RNNs and their variants are used to perform those tasks since RNNs can recognize patterns that are defined by temporal distance.
3.1 Unusual CPU and Memory Consumption Detection
The proposed unusual CPU and memory consumption detection model aims at detecting unusual CPU and memory consumption of an application from its resource consumption series and runtime policy in a given time period. Model analyzes the resource consumption data of the application in the specified time period, including a series of its CPU consumption, its memory usage, as well as its runtime policy, and then outputs the probabilities of the existence of unusual CPU consumption and unusual memory consumption if there are any. Since it is a classification problem with time series analysis, we proposed a detection model based on LSTM network to model the time series, and followed by a 3?layer neural network to perform the classification.
3.1.1 Problem Formalization
The proposed model consists of two parts: unusual CPU consumption detection model and unusual memory consumption detection model. For unusual CPU consumption detection model, given an input time series [X=X1,X2, ... Xt, ...,XT], where [Xt] represents a sampling point including the current CPU consumption and runtime policy of an application and [T] is the length of the input series, the model analyzes the time series, predicts the probability of the existence of unusual resource consumption, and finally assigns the input into one of the two classes: unusual resource consumption or non?unusual resource consumption. Unusual memory consumption detection model has the same structure as the unusual CPU consumption detection model, but the sampling points [Xt] in its input seriesonly includes the current memory consumption data. 3.1.2 Normalization of Input Series
At each sampling point, the current CPU consumption is expressed as a percentage, ranging from 0 to 100, and the current memory consumption ranges from 0 to 106 KB. These values are all positive and too large for a neural network based model. Therefore, we introduced a normalization step before the proposed model processes the input data. We calculated the average and standard variance of the CPU consumption [(avgcpu,stdvcpu)] and the memory consumption[(avgmem,stdvmem)] of all the sampling points in the training set, and then normalized the resource consumption values in each sampling point by subtracting the average from the original value, and then dividing the result by the standard variance:
We trained our model by minimizing the negative log?likelihood of all the samples of the training set, and the model parameters are optimized by Adam [26], with hyper?parameters recommended by the authors (i.e., learning rate = 0.001, β1 = 0.9, β2 = 0.999).
3.2 Abnormal Start?Up Prediction
To save the limited resource of a smart phone, a possible way is to stop application processes when they are abnormally started. By predicting whether a start?up of an application process is abnormal, memory and computational resources could be allocated more efficiently, which further optimizes the performance of the smart phone. The start?up prediction problem can be viewed as a binary classification problem. We proposed a hybrid system, consisting of a rule?based model and a deep learning?based model.
In this section, we first formalize the prediction problem and introduce a set of rules to label all start?ups with either NORMAL or ABNORMAL tag by leveraging the full information contained in the given data set. Then a hybrid system is constructed to solve the prediction problem. Finally, the deep learning model is described in more details.
3.2.1 Problem Formalization
The data can be labeled with six steps or six rules from R1 to R6 (they are omitted here for security reason). Following those rules, the raw data are tagged with binary labels. The rules can be divided into two parts according to whether a rule can be used in the inference. The first part consists of rule {R1, R2, R3}, which infers the label with previous logs without any future information. The rule 4 labels the data with future information, while these data are not available in the usage. Thus the rule R4 as well as the rules with lower priority form the second part. The first part can be tackled with trivial rules, namely rule {R1, R2, R3}, while the second part should be determined by a probabilistic machine learning model. 3.2.2 Hybrid Model
As shown in the previous section, the abnormal start?up prediction is defined as a binary classification problem, aiming at predicting whether a start?up of an application process is abnormal with a given set of logs. We proposed a hybrid model to solve this problem.
The model consists of two parts, a rule?based part and a deep learning?based part. The targeted start?up, as well as the previous logs, is first fed into the rule?based model, which is defined by the rule {R1, R2, R3}, and then generates one of the three possible results, namely NORMAL, ABNORMAL, and UNDETERMINED. The first two are deterministic ones, and will be directly outputted by the hybrid model. The last result indicates that the rule?based model is incapable of predicting the results, because it is determined by {R4, R5, R6}. Therefore, such UNDETERMINED data will be further fed into the deep learning model, and the deep learning model will output a probability, reflecting the likelihood of whether the start?up is NORMAL.
3.2.3 Deep Learning Model
The deep learning model aims at providing the probability of a NORMAL start?up which cannot be determined by rules based on previous logs. In this situation, the deep learning model is designed to predict whether the process about to startup will die in the future, namely in one second.
Unlike the rule?based model, the deep learning model cannot map a log to a predicted class directly. Because the deep learning model is capable of fitting data and logs without pre?processing, it could be full of noise, which in turn may result in a bad performance since noise could also be learned by the deep learning model. To reduce such noise in the input of the deep learning model, a feature extraction component is introduced to compute a set of features from the original logs by leveraging human priority. The necessity of these features are demonstrated by preliminary experiments. There are five features extracted listed in Table 2.
There are three layers in the neural networks. The first is the five parallel embedding layer, each of which transforms a respective feature into its dense vector representation. Then these vectors are combined with an addition operation, and this layer is designed to combine these parallel features into a joint feature. Finally, we use a logistic regression with this joint feature as its input to estimate the probability of whether the start?up is NORMAL. The training parameters are defined as all the embedding layers and the parameters of logistic regression. In this training process, the cross?entropy is used to compute the loss function. To learn the parameter of the deep learning model, Adam optimizer is used. 4 Experiments
We conducted experiments on both tasks to evaluate our model. In the following section, we will first describe the datasets we used and then show the performance of the proposed model on both tasks.
4.1 Data Sets and Preprocessing
For unusual CPU and memory consumption detection task, we prepared a dataset including 992 896 series of resource consumption in the training set, and 653 233 series in the testing set. The length of these series ranges from 1 to 12, and the average length is about 9.
Unusual CPU consumption occurs in 2.0% of the series in the training set, and unusual memory consumption occurs in 1.6% of the series in the training set. The proportion of the positive and negative samples are too large. In order to prevent our model from tilting, we used a sample strategy in the training process to ensure that the model learn an equal number of positive and negative samples. During testing, we normalized the input series with the average and standard variance calculated from the training set.
4.2 Unusual CPU and Memory Consumption Detection
The deep learning model is implemented with Tensorflow. We set the embedding dimensionality of all features to 20 and the size of mini?batch to 64. The hyper?parameters for Adam optimizer were set to their default values. The results of our experiment is listed below in Table 6. It shows that our model demonstrated extremely well with high performance on the machine labelled data set.
5 Conclusions
In this paper, we have described a deep neural network?based model for detecting abnormal application start?ups, and unusual CPU and memory consumptions of the application processes running on Android systems. A variant of recurrent neural network architecture with multiple layers was implemented and tested systematically. The experiment results showed that the proposed neural networks performed reasonably well on the two tasks, offering the potential of the proposed networks for practical time serious analyzing and other similar tasks. The number of parameters in neural network?based models is usually much less than other competitors, such as conditional random fields (CRFs). Besides, only simple four?arithmetic operations are required to run the neural network?based models after the models are well trained. So neural network?based models are clearly run considerably faster and require much less memory than that of other models. The speed advantage of those neural networks makes them even more competitive for the applications requiring real?time response, especially for the applications deployed on the smart phones. References
[1] BENGIO Y. Learning Deep Architectures for AI [J]. Foundations and Trends in Machine Learning, 2009, 2(1): 1-127. DOI: 10.1561/2200000006
[2] LECUN Y, BENGIO Y, HINTON G. Deep Learning [J]. Nature, 2015, 521(7553): 436-444. DOI: 10.1038/nature14539
[3] SCHMIDHUBER J. Deep Learning in Neural Networks: An Overview [J]. Neural Networks, 2015, 61: 85-117. DOI: 10.1016/j.neunet.2014.09.003
[4] HOFFMAN J, TZENG E, DONAHUE J, et al. One?Shot Adaptation of Supervised Deep Convolutional Models [EB/OL]. (2013?12?21)[2018?04?15]. http://arxiv.org/abs/1312.6204
[5] KINGMA D P, WELLING M. Auto?Encoding Variational Bayes [EB/OL]. (2013?12?20)[2018?04?15]. https://arxiv.org/abs/1312.6114
[6] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet Classification with Deep Convolutional Neural Networks [J]. Communications of the ACM, 2017, 60(6): 84-90. DOI: 10.1145/3065386
[7] MIKOLOV T, SUTSKEVER I, DEORAS A et al. Subword Language Modelling with Neural Networks [EB/OL]. (2012) [2018?04?15]. http://www.fit.vutbr.cz/~imikolov/rnnlm/char.pdf
[8] Graves A. Generating Sequences With Recurrent Neural Networks [EB/OL]. (2013?08?04) [2018?04?15]. https://arxiv.org/abs/1308.0850
[9] CHO K, MERRIENBOER B van, GULCEHRE C, et al. Learning Phrase Representations Using RNN Encoder?Decoder for Statistical Machine Translation [C]//Conference on Empirical Methods in Natural Language Processing. Doha, Qatar, 2014. DOI: 10.3115/v1/D14?1179
[10] SUTSKEVER I, VINYALS O L, Le Q V. Sequence to Sequence Learning with Neural Networks [M]//Advances in Neural Information Processing Systems. Cambridge, USA: The MIT Press, 2014: 3104-3112
[11] VINYALS O, TOSHEV A, BENGIO S, et al. Show and Tell: A Neural Image Caption Generator [C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, USA, 2015: 3156-3164. DOI: 10.1109/CVPR.2015. 7298935
[12] MOZER M C. Induction of Multiscale Temporal Structure [C]//Proc. 4th International Conference on Neural Information Processing Systems. San Francisco, USA: Morgan Kaufmann Publishers Inc., 1991: 275-282.
[13] HIHI S E, BENGIO Y. Hierarchical Recurrent Neural Networks for Long?Term Dependencies [C]//Proc. 8th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 1995: 493-499
[14] LIN T, HORNE B G, TINO P, et al. Learning Long?Term Dependencies in NARX Recurrent Neural Networks [J]. IEEE Transactions on Neural Networks, 1996, 7(6): 1329-1338. DOI: 10.1109/72.548162 [15] KOUTN?K J, GREFF K, GOMEZ F, et al. A Clockwork RNN [C]//31st International Conference on Machine Learning. Beijing, China, 2014: 1863-1871
[16] AHMED N K, ATIYA A F, GAYAR N E, et al. An Empirical Comparison of Machine Learning Models for Time Series Forecasting [J]. Econometric Reviews, 2010, 29(5/6): 594-621. DOI: 10.1080/07474938.2010.481556
[17] BONTEMPI G, BEN TAIEB S, LE BORGNE Y A. Machine Learning Strategies for Time Series Forecasting [M]//BONTEMPI G, BEN TAIEB S, LE BORGNE Y A. eds. Business Intelligence. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013: 62-77. DOI:10.1007/978?3?642?36318?4_3
[18] ROBINSON A J, FALLSIDE F. The Utility Driven Dynamic Error Propagation Network: CUED/FINFENG/TR.1 [R]. Cambridge, UK: Cambridge University, Engineering Department, 1987
[19] WERBOS P J. Generalization of Backpropagation with Application to a Recurrent Gas Market Model [J]. Neural Networks, 1988, 1(4): 339-356. DOI: 10.1016/0893?6080(88)90007?x
[20] Williams R J. Complexity of Exact Gradient Computation Algorithms for Recurrent Neural Networks: NUCCS?89?27 [R]. Boston USA: Northeastern University, College of Computer Science, 1989
[21] Hochreiter S, Bengio Y, Frasconi P, et al. Gradient Flow in Recurrent Nets: The Difficulty of Learning Long?Term Dependencies [M]// Kremer S C, Kolen, J F eds. A Field Guide to Dynamical Recurrent Networks. Hoboken, USA: IEEE Press, 2001
[22] HOCHREITER S, SCHMIDHUBER J. Long Short?Term Memory [J]. Neural Computation, 1997, 9(8): 1735-1780. DOI: 10.1162/neco.1997.9.8.1735
[23] MARTENS J, SUTSKEVER I. Learning Recurrent Neural Networks with Hessian?Free Optimization [C]//28th International Conference on Machine Learning. Bellevue, USA, 2011: 1033-1040
[24] SUTSKEVER I, MARTENS J, DAHL G E, et al. On the Importance of Initialization and Momentum in Deep Learning [C]//30th International Conference on Machine Learning. Atlanta, USA, 2013: 1139-1147
[25] GERS F A, SCHRAUDOLPH N N, SCHMIDHUBER J. Learning Precise Timing with LSTM Recurrent Networks [J]. Journal of Machine Learning Research, 2002, 3:115-143
[26] KINGMA D, Ba J. Adam: A Method for Stochastic Optimization [EB/OL]. (2014?12?22) [2018?04?15] https://arxiv.org/abs/1412.6980
Keywords: deep learning; time series analysis; convolutional neural network; RNN
1 Introduction
eep learning algorithm emerged as a successful machine learning technique a few years ago. With the deep architectures, it became possible to learn high?level (compact) representations, each of which combines features at lower levels in an exponential and hierarchical way [1]-[3]. A stack of representation layers, learned from the data in order to optimize the given objective, make deep neural networks gain advantages such as generalization to unknown examples [4], discovering disentangling factors of variation and sharing learned representations among multiple tasks [5]. The recent successes of the deep convolutional neural networks (CNNs) are mainly based on such ability to learn hierarchical representation for spatial data [6]. For modeling temporal data, the recent resurgence of recurrent neural networks (RNN) has led to remarkable advances [6]-[11]. Unlike the spatial data, learning both hierarchical and temporal representation is one of the long?standing challenges for RNNs in spite of the fact that hierarchical structures naturally exist in many temporal data [12]-[15].
Forecasting future values of the observed time series in fact plays an important role in nearly all fields of science and engineering, such as economics, finance, business intelligence, and industrial applications. There has been extensive research on using machine learning techniques for time?series forecasting. Several machine learning algorithms were presented to tackle time series forecasting problem, such as multilayer perceptron, Bayesian neural networks, K?nearest neighbor regression, support vector regression, and Gaussian processes [16]. The effectiveness of local learning techniques is explored for dealing with temporal data [17]. In this study, we tried to detect abnormal start?ups, unusual CPU and memory consumptions of the application processes running on smart phones using RNNs, which falls in line with the recent efforts to analyze time series data in order to extract meaningful statistics and other characteristics of the data with the deep learning approach. The paper is organized as follows. First, we give an overview of the research goals, and then try to convey an intuition of the key ideas in Section 2. Section 3 presents RNN?based models to detect unusual CPU, memory consumptions and abnormal start?ups of the application process running on the smart phones. Section 4 reports results of a number of experiments on real?life devices and shows the effectiveness of the proposed models. The conclusion and future work are summarized in Section 5.
2 Background
Recurrent neural networks are a class of artificial neural networks that possess internal state or short?term memory due to recurrent feed?back connections that make them suitable for dealing with sequential tasks, such as speech recognition, prediction and generation [18]-[20]. Traditional RNNs trained with stochastic gradient?descent (SGD) have difficulty learning long?term dependencies (i.e. spanning more than ten time?steps lag) encoded in the input sequences due to vanishing gradient [21]. This problem has been partly addressed by using a specially designed neuron structure, or cell, in long short?term memory (LSTM) networks [21], [22] that keeps constant backward flow in the error signal; second?order optimization methods [23] preserve the gradients by approximating their curvature; or using informed random initialization [24] which allows for training the networks with momentum and stochastic gradient?descent only.
In conventional LSTM each gate receives connections from the input units and the outputs of all cells, but there is no direct connection from the Constant Error Carrousel (CEC) it is supposed to control. All it can observe directly is the cell output, which is close to zero as long as the output gate is closed. The same problem occurs for multiple cells in a memory block: when the output gate is closed none of the gates has access to the CECs they control. The resulting lack of essential information may harm network performance. Gers, Schraudolph, and Schmidhuber [25] suggested adding weighted “peephole” connections from the CEC to the gates of the same memory block. The peephole connections allow all gates to inspect the current cell state even when the output gate is closed. The information can be essential for finding well?working network solutions. During learning no error signals are propagated back from gates via peephole connections to the CEC. Peephole connections are treated like regular connections to gates except for updating timing. Gated recurrent units (GRUs) are a gating mechanism in recurrent neural networks (a variant of LSTM), introduced by Cho et al. [9]. Their performance on polyphonic music modeling and speech signal modeling was found to be similar to that of long short?term memory. GRUs have been shown to exhibit better performance on smaller datasets because they have fewer parameters than LSTM, as they lack an output gate. There are several variations on the full gated units, with gating done using the previous hidden state and the bias in different combinations.
3 Deep Neural Network?Based Models
Two deep neural network?based models will be described in this section. Both tasks mentioned above are time series forecasting, which uses a model to predict future values based on previously observed values. RNNs and their variants are used to perform those tasks since RNNs can recognize patterns that are defined by temporal distance.
3.1 Unusual CPU and Memory Consumption Detection
The proposed unusual CPU and memory consumption detection model aims at detecting unusual CPU and memory consumption of an application from its resource consumption series and runtime policy in a given time period. Model analyzes the resource consumption data of the application in the specified time period, including a series of its CPU consumption, its memory usage, as well as its runtime policy, and then outputs the probabilities of the existence of unusual CPU consumption and unusual memory consumption if there are any. Since it is a classification problem with time series analysis, we proposed a detection model based on LSTM network to model the time series, and followed by a 3?layer neural network to perform the classification.
3.1.1 Problem Formalization
The proposed model consists of two parts: unusual CPU consumption detection model and unusual memory consumption detection model. For unusual CPU consumption detection model, given an input time series [X=X1,X2, ... Xt, ...,XT], where [Xt] represents a sampling point including the current CPU consumption and runtime policy of an application and [T] is the length of the input series, the model analyzes the time series, predicts the probability of the existence of unusual resource consumption, and finally assigns the input into one of the two classes: unusual resource consumption or non?unusual resource consumption. Unusual memory consumption detection model has the same structure as the unusual CPU consumption detection model, but the sampling points [Xt] in its input seriesonly includes the current memory consumption data. 3.1.2 Normalization of Input Series
At each sampling point, the current CPU consumption is expressed as a percentage, ranging from 0 to 100, and the current memory consumption ranges from 0 to 106 KB. These values are all positive and too large for a neural network based model. Therefore, we introduced a normalization step before the proposed model processes the input data. We calculated the average and standard variance of the CPU consumption [(avgcpu,stdvcpu)] and the memory consumption[(avgmem,stdvmem)] of all the sampling points in the training set, and then normalized the resource consumption values in each sampling point by subtracting the average from the original value, and then dividing the result by the standard variance:
We trained our model by minimizing the negative log?likelihood of all the samples of the training set, and the model parameters are optimized by Adam [26], with hyper?parameters recommended by the authors (i.e., learning rate = 0.001, β1 = 0.9, β2 = 0.999).
3.2 Abnormal Start?Up Prediction
To save the limited resource of a smart phone, a possible way is to stop application processes when they are abnormally started. By predicting whether a start?up of an application process is abnormal, memory and computational resources could be allocated more efficiently, which further optimizes the performance of the smart phone. The start?up prediction problem can be viewed as a binary classification problem. We proposed a hybrid system, consisting of a rule?based model and a deep learning?based model.
In this section, we first formalize the prediction problem and introduce a set of rules to label all start?ups with either NORMAL or ABNORMAL tag by leveraging the full information contained in the given data set. Then a hybrid system is constructed to solve the prediction problem. Finally, the deep learning model is described in more details.
3.2.1 Problem Formalization
The data can be labeled with six steps or six rules from R1 to R6 (they are omitted here for security reason). Following those rules, the raw data are tagged with binary labels. The rules can be divided into two parts according to whether a rule can be used in the inference. The first part consists of rule {R1, R2, R3}, which infers the label with previous logs without any future information. The rule 4 labels the data with future information, while these data are not available in the usage. Thus the rule R4 as well as the rules with lower priority form the second part. The first part can be tackled with trivial rules, namely rule {R1, R2, R3}, while the second part should be determined by a probabilistic machine learning model. 3.2.2 Hybrid Model
As shown in the previous section, the abnormal start?up prediction is defined as a binary classification problem, aiming at predicting whether a start?up of an application process is abnormal with a given set of logs. We proposed a hybrid model to solve this problem.
The model consists of two parts, a rule?based part and a deep learning?based part. The targeted start?up, as well as the previous logs, is first fed into the rule?based model, which is defined by the rule {R1, R2, R3}, and then generates one of the three possible results, namely NORMAL, ABNORMAL, and UNDETERMINED. The first two are deterministic ones, and will be directly outputted by the hybrid model. The last result indicates that the rule?based model is incapable of predicting the results, because it is determined by {R4, R5, R6}. Therefore, such UNDETERMINED data will be further fed into the deep learning model, and the deep learning model will output a probability, reflecting the likelihood of whether the start?up is NORMAL.
3.2.3 Deep Learning Model
The deep learning model aims at providing the probability of a NORMAL start?up which cannot be determined by rules based on previous logs. In this situation, the deep learning model is designed to predict whether the process about to startup will die in the future, namely in one second.
Unlike the rule?based model, the deep learning model cannot map a log to a predicted class directly. Because the deep learning model is capable of fitting data and logs without pre?processing, it could be full of noise, which in turn may result in a bad performance since noise could also be learned by the deep learning model. To reduce such noise in the input of the deep learning model, a feature extraction component is introduced to compute a set of features from the original logs by leveraging human priority. The necessity of these features are demonstrated by preliminary experiments. There are five features extracted listed in Table 2.
There are three layers in the neural networks. The first is the five parallel embedding layer, each of which transforms a respective feature into its dense vector representation. Then these vectors are combined with an addition operation, and this layer is designed to combine these parallel features into a joint feature. Finally, we use a logistic regression with this joint feature as its input to estimate the probability of whether the start?up is NORMAL. The training parameters are defined as all the embedding layers and the parameters of logistic regression. In this training process, the cross?entropy is used to compute the loss function. To learn the parameter of the deep learning model, Adam optimizer is used. 4 Experiments
We conducted experiments on both tasks to evaluate our model. In the following section, we will first describe the datasets we used and then show the performance of the proposed model on both tasks.
4.1 Data Sets and Preprocessing
For unusual CPU and memory consumption detection task, we prepared a dataset including 992 896 series of resource consumption in the training set, and 653 233 series in the testing set. The length of these series ranges from 1 to 12, and the average length is about 9.
Unusual CPU consumption occurs in 2.0% of the series in the training set, and unusual memory consumption occurs in 1.6% of the series in the training set. The proportion of the positive and negative samples are too large. In order to prevent our model from tilting, we used a sample strategy in the training process to ensure that the model learn an equal number of positive and negative samples. During testing, we normalized the input series with the average and standard variance calculated from the training set.
4.2 Unusual CPU and Memory Consumption Detection
The deep learning model is implemented with Tensorflow. We set the embedding dimensionality of all features to 20 and the size of mini?batch to 64. The hyper?parameters for Adam optimizer were set to their default values. The results of our experiment is listed below in Table 6. It shows that our model demonstrated extremely well with high performance on the machine labelled data set.
5 Conclusions
In this paper, we have described a deep neural network?based model for detecting abnormal application start?ups, and unusual CPU and memory consumptions of the application processes running on Android systems. A variant of recurrent neural network architecture with multiple layers was implemented and tested systematically. The experiment results showed that the proposed neural networks performed reasonably well on the two tasks, offering the potential of the proposed networks for practical time serious analyzing and other similar tasks. The number of parameters in neural network?based models is usually much less than other competitors, such as conditional random fields (CRFs). Besides, only simple four?arithmetic operations are required to run the neural network?based models after the models are well trained. So neural network?based models are clearly run considerably faster and require much less memory than that of other models. The speed advantage of those neural networks makes them even more competitive for the applications requiring real?time response, especially for the applications deployed on the smart phones. References
[1] BENGIO Y. Learning Deep Architectures for AI [J]. Foundations and Trends in Machine Learning, 2009, 2(1): 1-127. DOI: 10.1561/2200000006
[2] LECUN Y, BENGIO Y, HINTON G. Deep Learning [J]. Nature, 2015, 521(7553): 436-444. DOI: 10.1038/nature14539
[3] SCHMIDHUBER J. Deep Learning in Neural Networks: An Overview [J]. Neural Networks, 2015, 61: 85-117. DOI: 10.1016/j.neunet.2014.09.003
[4] HOFFMAN J, TZENG E, DONAHUE J, et al. One?Shot Adaptation of Supervised Deep Convolutional Models [EB/OL]. (2013?12?21)[2018?04?15]. http://arxiv.org/abs/1312.6204
[5] KINGMA D P, WELLING M. Auto?Encoding Variational Bayes [EB/OL]. (2013?12?20)[2018?04?15]. https://arxiv.org/abs/1312.6114
[6] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet Classification with Deep Convolutional Neural Networks [J]. Communications of the ACM, 2017, 60(6): 84-90. DOI: 10.1145/3065386
[7] MIKOLOV T, SUTSKEVER I, DEORAS A et al. Subword Language Modelling with Neural Networks [EB/OL]. (2012) [2018?04?15]. http://www.fit.vutbr.cz/~imikolov/rnnlm/char.pdf
[8] Graves A. Generating Sequences With Recurrent Neural Networks [EB/OL]. (2013?08?04) [2018?04?15]. https://arxiv.org/abs/1308.0850
[9] CHO K, MERRIENBOER B van, GULCEHRE C, et al. Learning Phrase Representations Using RNN Encoder?Decoder for Statistical Machine Translation [C]//Conference on Empirical Methods in Natural Language Processing. Doha, Qatar, 2014. DOI: 10.3115/v1/D14?1179
[10] SUTSKEVER I, VINYALS O L, Le Q V. Sequence to Sequence Learning with Neural Networks [M]//Advances in Neural Information Processing Systems. Cambridge, USA: The MIT Press, 2014: 3104-3112
[11] VINYALS O, TOSHEV A, BENGIO S, et al. Show and Tell: A Neural Image Caption Generator [C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, USA, 2015: 3156-3164. DOI: 10.1109/CVPR.2015. 7298935
[12] MOZER M C. Induction of Multiscale Temporal Structure [C]//Proc. 4th International Conference on Neural Information Processing Systems. San Francisco, USA: Morgan Kaufmann Publishers Inc., 1991: 275-282.
[13] HIHI S E, BENGIO Y. Hierarchical Recurrent Neural Networks for Long?Term Dependencies [C]//Proc. 8th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 1995: 493-499
[14] LIN T, HORNE B G, TINO P, et al. Learning Long?Term Dependencies in NARX Recurrent Neural Networks [J]. IEEE Transactions on Neural Networks, 1996, 7(6): 1329-1338. DOI: 10.1109/72.548162 [15] KOUTN?K J, GREFF K, GOMEZ F, et al. A Clockwork RNN [C]//31st International Conference on Machine Learning. Beijing, China, 2014: 1863-1871
[16] AHMED N K, ATIYA A F, GAYAR N E, et al. An Empirical Comparison of Machine Learning Models for Time Series Forecasting [J]. Econometric Reviews, 2010, 29(5/6): 594-621. DOI: 10.1080/07474938.2010.481556
[17] BONTEMPI G, BEN TAIEB S, LE BORGNE Y A. Machine Learning Strategies for Time Series Forecasting [M]//BONTEMPI G, BEN TAIEB S, LE BORGNE Y A. eds. Business Intelligence. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013: 62-77. DOI:10.1007/978?3?642?36318?4_3
[18] ROBINSON A J, FALLSIDE F. The Utility Driven Dynamic Error Propagation Network: CUED/FINFENG/TR.1 [R]. Cambridge, UK: Cambridge University, Engineering Department, 1987
[19] WERBOS P J. Generalization of Backpropagation with Application to a Recurrent Gas Market Model [J]. Neural Networks, 1988, 1(4): 339-356. DOI: 10.1016/0893?6080(88)90007?x
[20] Williams R J. Complexity of Exact Gradient Computation Algorithms for Recurrent Neural Networks: NUCCS?89?27 [R]. Boston USA: Northeastern University, College of Computer Science, 1989
[21] Hochreiter S, Bengio Y, Frasconi P, et al. Gradient Flow in Recurrent Nets: The Difficulty of Learning Long?Term Dependencies [M]// Kremer S C, Kolen, J F eds. A Field Guide to Dynamical Recurrent Networks. Hoboken, USA: IEEE Press, 2001
[22] HOCHREITER S, SCHMIDHUBER J. Long Short?Term Memory [J]. Neural Computation, 1997, 9(8): 1735-1780. DOI: 10.1162/neco.1997.9.8.1735
[23] MARTENS J, SUTSKEVER I. Learning Recurrent Neural Networks with Hessian?Free Optimization [C]//28th International Conference on Machine Learning. Bellevue, USA, 2011: 1033-1040
[24] SUTSKEVER I, MARTENS J, DAHL G E, et al. On the Importance of Initialization and Momentum in Deep Learning [C]//30th International Conference on Machine Learning. Atlanta, USA, 2013: 1139-1147
[25] GERS F A, SCHRAUDOLPH N N, SCHMIDHUBER J. Learning Precise Timing with LSTM Recurrent Networks [J]. Journal of Machine Learning Research, 2002, 3:115-143
[26] KINGMA D, Ba J. Adam: A Method for Stochastic Optimization [EB/OL]. (2014?12?22) [2018?04?15] https://arxiv.org/abs/1412.6980