A Novel Anti-Obfuscation Model for Detecting Malicious Code

Yuehan Wang Tong Li YongQuan Cai Zhenghu Ning Fei Xue

Abstract

In this article, the authors present a new malicious code detection model. The detection model improves typical n-gram feature extraction algorithms that are easy to be obfuscated. Specifically, the proposed model can dynamically determine obfuscation features and then adjust the selection of meaningful features to improve corresponding machine learning analysis. The experimental results show that the feature database, which is built based on the proposed feature selection and cleaning method, contains a stable number of features and can automatically get rid of obfuscation features. Overall, the proposed detection model has features of long timeliness, high applicability and high accuracy of identification

Keyword：

Malicious、Machine Learning、Feature Extraction、Obfuscation

1 Introduction

The malicious code is a kind of software that is intended to damage or disable computers and computer systems, including computer Trojans, blackmail software, spyware, and so on . According to Symantec (2015), more than 44.5 million new pieces of malware created in May 2015. One of the main reasons for this high volume of malware samples is the extensive use of obfuscation and metamorphic techniques by malware developers . So the most of new malicious code can be divided into several families by the original code . The malicious code detection technologies are usually based on features, which represent the original software code. Thus, same malware familys should have the same features (e.g., Wołkowicz & Kešelj (2013) and Preda & Giacobazzi (2005)). By extracting the family features in each malware family, the defense systems can constructs a feature database for detecting variants. However, the obfuscation techniques can help variants to escape the detection by interfering the feature extraction. For example, in the malicious defense system (Lu, Wang, Zhao, Wang, & Su, 2013) which extracting the key string as a feature. Variants escape the detection by equivalently replacing the key string or adding the invalid string. Many scholars (Shafiq, Tabish, Mirza, & Farooq, 2009; Sung, Xu, Chavez, & Mukkamala, 2004; Gaudesi, Marcelli, Sanchez, Squillero, & Tonda, 2016; Tabish, Shafiq, & Farooq,2009) have proposed various feature extraction methods to defend against this kind of obfuscation technology. But such extraction methods can also be broken by emerging obfuscation technology. On the other hand, more effectively extraction methods will also lead to excessive computing resources, systems real-time poor and so on. Machine learning model (Tahan, Rokach, & Shahar, 2012; Narouei, Ahmadi, Giacinto, Takabi, & Sami, 2015; O.W.D.C., 1992) are used to deal with detection malicious code, which have achieved good results. Through the feature database and labels, the model will train a set of classifiers to identify the variants. However, the accuracy of machine learning model depend on the quality of feature database, so that the feature extraction method will determine the accuracy of model . When the extraction method is broken, the obfuscation technologys (Nataraj, Karthikeyan, Jacob, & Manjunath, 2011; Fredrikson, Jha, Christodorescu, Sailer, & Yan, 2010; Svetnik et al., 2003) will make feature database contains a lot of obfuscation features and the accuracy will be be seriously influenced .For the machine learning model used in detection malicious code, ensuring the effectiveness of feature database is an essential research task . In particular, due to the rapid growth of malicious code, the timeliness of feature extraction method becomes more and more short. In addition, it becomes increasingly difficult to maintain the security of the system by using the replacement feature extraction method. In this article, we propose a method to ensure the effectiveness of feature database which cleans the feature database rather than changing the extraction method.The method was guided by the obfuscation features cleaning and feature selection. The final database will be used in the random forests algorithm.The main contributions of this paper are summarized as follows: 1. An algorithm based on multi-sample analysis is proposed to identity obfuscation features dynamically. This method get through analyzing some numbers of sample data in detail and builds a linear regression algorithm. This linear algorithm is used to compute the thresholds of the obfuscation features dynamically for each sample.

2. A feature selection algorithm is proposed to select family feature. The method first normalizes the eigenvector and identify the family feature according to the number of input data set.

3. Achieving the malicious code detection model. The model use random forest algorithm to reduce the effect of obfuscation technologys furtherly and improves the data utilization.The detection of result by the classifier voted.

2. METHODOLOGY

In this paper, the main research content is to clean the feature database based on the machine learning malicious code detection system. The n-gram algorithm is one of the earliest feature extraction algorithms for malware code and the feature has a stronger readable and interpretable that extract by the extraction method. It is impossible to guarantee the feature extraction algorithm never be broken. So this paper choose n-gram feature extraction method to build the feature database. The modern obfuscation technologys make n-gram algorithm is invalid and the feature database is full of obfuscation and noisy feature. So the final feature database was guide by the obfuscation features cleaning method and feature selection method. The two clean method make the feature database has a good anti-interference, and replace the bad features automatically with the training samples increased. As shown in Figure 1. Firstly, this paper construct the linear regression algorithm to identity the obfuscation features and clean dynamically.Secondly, the normalizing method make the eigenvalues range be uniform so that the feature selection method will not be affected by the range of eigenvalues. The feature selection method will guide the train database to select the family features. Finally, this paper choose random forest algorithm to construct the classifier cluster. For the test set samples, the final result is voted on by the cluster. The overall flow chart of the model is shown in Figure 1:

2.1. Initial Feature Database Construction

The n-gram algorithm is one of the most primitive malicious code feature extraction methods, which requires less malicious code acquisition and lower computational resources. According to this kind of extraction method, the features can be described from the perspective of the real semantics. This model actually characterizes the importance of each feature by counting the number of occurrences of each feature in a single sample.The obfuscation technology confuse the sequence of operations code and produced a lot of obfuscation features, greatly changing the distribution of the features of the sample. The value of obfuscation features are very large, resulting in the problem that these features have an important impact on the model.

For a “.asm” malicious code disassembler, mainly by the paragraph start identifier, memory address, bytecode, opcode and parameters formed. Disassembly program fragment as shown in Figure 2, segment that corresponds to the current instruction belongs to the paragraph, address means memory address, bytes corresponding to the hexadecimal code, opcode means opcode and operands are passed parameters. For a disassembled file, we can extract the corresponding opcode by locating “.text” and obtain the malicious code n-gram features.

2.2. Obfuscation Features Cleaning Method

The feature database is full of obfuscation and noise features.If the features database is used for training, the model would be likely to appear a serious over-fitting phenomenon, that the mode cannot detect new variants in the future. Therefore, it is necessary to clean the obfuscation features in the database. Due to the complexity of the samples in the training set, the obfuscation technologs used in each sample is different and the feature distribution is also different. Therefore, for each sample, it is necessary to calculation the value of obfuscation features dynamically. It is more reasonable to find the minimum value and clear others obfuscation features which larger than the value.

The thresholds of the obfuscation value (ξ) are dynamically changed in each sample. In order to measure and characterize the size of the value effectively, we define the following two indicators: the expected value (Feature_averages) and the feature standard (Feature_median). These two indicators are obtained by solving a single sample dynamically, which are used to describe the distribution of features in the sample. α and β represent the proportion of the characteristic expected value and the characteristic standard value in the feature obfuscation value respectively. The feature obfuscation threshold is calculated as shown in Equation (1), which reflects the relationship between the threshold and the expected standard values:

1	ξ = α * Feature_averages + β * Feature_median

For a malicious code sample, the features expected value ( Featureaverages represents the ideal value of the feature in the most primitive case of the sample. However, the n-gram algorithm will extract a lot of noisy features naturally which only appeared small amounts. If all feature are directly manipulated, the expected value will be seriously underestimated. The noisy features will be removed through the feature selection method. In this section, the method is used to remove the larger feature value. Therefore, when calculating of the expected value ( Feature_averages , the first step is to delete the same feature value, and then do the average operation. The expected value of the feature is calculated as shown in Equation (2), “m” is the number of residual features after removing the repetition, and the featurei represents the value of the each features.

1	Feature_averages = ∑feature(i)/m

The feature expected value can only be used to describe the current sample feature distribution. In order to describe the actual semantics of the sample this paper points out the characteristic standard value ( Featuremedian , which is obtained by calculating the median of all the features in the sample. It can reflect the ideal value of the feature when the sample is undisturbed. The distribution of features in a malicious code sample tends to be a Gaussian distribution, the obscure feature has a very small proportion in its feature distribution. Therefore, the range of the ideal feature can be obtained by solving the median value in the distribution.The characteristic standard value is calculated by the following formula 3, m is the number of residual features after removing the repetition, and the featurei represents the feature of the feature(i). The mid function is the median of the solution sequence.

1	Feature(median) = mid(feature(1),feature(2),...,feature(m))

By defining two indicators above, we can describe the obfuscation threshold for each malicious sample dynamically. However, the sample set is always very large. It is a very difficult task to analyze all the malicious code samples manually and find the appropriate weight parameter (α, β) to describe the obfuscation threshold. Linear regression is a statistical analysis method that uses the regression analysis in mathematical statistics to determine the quantitative relationship between two or more variables. Therefore, this paper chooses linear regression to learn the value of the parameter (α, β) and calculate the obfuscation threshold in each sample. However, each sample in the set has only the family label and it is useless to identify the obfuscation features. Thus,we should study a small number of samples and make the label of obfuscation features firstly. Then train the linear regression equation based on the labels. The finial equation can be used to predict the obfuscation threshold in unknown samples. The main steps are as follows:

Randomly select multiple malicious code sample files.
Artificial analysis of sample files to extract the obfuscation threshold.
Extracting the feature values and characteristic values of each sample.
Selecting 70% of the samples as the training samples of the model, and 30% of the samples as the test samples of the model.
Selecting the characteristic prediction value and the characteristic standard value as the model input data and the obfuscation threshold as the prediction data of the model. Inputting Linear Regression Prediction Model Training.
The final linear regression equation is obtained to dynamically describe the confounding threshold of the sample.

2.3. Feature Selection Method

The n-gram algorithm will extract many noisy features. they will result in some problems, including unclear data features, overheaded model calculation and other issues.Although most of smaller feature values are made up of the noisy data, some of them are important family features in the malicious code. If all the smaller features are cleared, the accuracy of the model will be interfered inevitably. This paper points out a way to quantify the importance of characterization based on the size of the input training data set. If a small feature feature belongs to the noisy data in the overall training set, the feature will only appear in a few samples. And, if it belongs to family features, it will be repeated in the same samples. Therefore, we can select the family features by summing all the sample features. Before constructing the feature selection scheme, this paper refers an operation of feature normalization. Due to the diversity of malicious code samples, the range of eigenvectors in each sample is also different. For the same value of the feature, the importance is different. In order to eliminate the impact of the final measurement of the features which caused by different ranges, this paper presents a standardized operation based on occupancy. For a single sample, the importance of each feature in the sample is measured by calculating the ratio and sum. The characteristic criterion algorithm is as follows: Equation (4), m is the number of all the features in the current sample and feature' represents the new value of the feature after the standardization.

1	feature(i) = feature(i)/∑feature(i) (4)

From the formula we can see that for the normalized training feature database, the sum of all features of a single sample is 1. Therefore, for the total number of samples S, the sum of all features is S. This paper proposes a feature selection method based on the number of input sample S and the number of malicious code family n in the training set. Since the family features will be repeated in the same family, they will increase the size of the final feature after they have been accumulated. As shown in Equation (5), the value of the Featurei or a feature is the sum of the values of the feature in all sample files. Featurei is the value of the final i-th feature, S is the number of training set samples and Featurei represents the value of the current feature in each sample.

1	Feature(i) = ∑feature(i) (5)

By studying a small number of samples, we find that for a single sample, the number of features that can usually be extracted is much larger than 100 and floats around 1000. Thus, for a valid feature, the proportion in the sample feature sequence is higher than 0.001. Taking into account that each feature is finally summed by the corresponding features in each sample. So the final selection for the training of the features should meet the following formula 6. Featureselect is the final selection feature for the training signature, S is the total number of training samples and n is the number of malicious code family categories in the sample.

1	Feature(select) > S*0.001/n (6)

2.4. Classifier Cluster Construction

Due to the complexity of malicious code variants, the final feature database is also very difficult to remove all invalid features completely. Therefore, it is necessary to sample the set and put back. it will improve the generalization and diversity of data sets. The final detection system will be better in anti-interference, robustness. The stochastic forest tree model is an improved model based on the decision tree. The model constructs multiple decision trees by selecting the sample combination feature vector randomly. The malicious database are taken by random sampling with replacement.The different sample set maximizes the utilization rate of the database, which improves the ability of the model for detection future variants of malicious code. The random forest model is shown in Figure 3:

When the random forest in training, it will choose different features to build database randomly. In the training feature database that used by the model, some bad features cannot be cleaned up which means there are still some obfuscation features. For the single random node in the model, since the input feature may contain obfuscation features, the final output classifier maybe have some poor bad classifier. However, after the previous obfuscation process, the vast majority of the obfuscation features in the training database has been cleared. What is more, the remaining bad features have only a small proportion in the feature database. By extracting the feature randomly, the effect of the bad features on the final classification can be further relieved.

3. eXPeRIMeNTAL DATA AND RESULTS ANALYSIS

3.1. experimental environment and Test Data

The malicious code anti-obfuscation model based on large data is used by the database provided by Microsoft. The set contains a set of known malware files representing a mix of 9 different families. Each malware file has an Id, a 20 character hash value uniquely identifying the file, and a Class, an integer representing one of 9 family names to which the malware may belong:1. Ramnit, 2. Lollipop, 3. Kelihos_ver3, 4. Vundo, 5. Simda, 6. Tracur, 7. Kelihos_ver1, 8. Obfuscator.ACY, 9. Gatak. For each file, the raw data contains the hexadecimal representation of the file’s binary content, without the PE header (to ensure sterility). The total of number of malicious code samples is 10868. The malicious code sample test set is described below in Figure 4:

3.2. experimental Design

This experiment mainly verifies the validity of the two feature processing methods mentioned above. According to the previous papers (Svetnik et al, 2003; Latinne, Debeir, & Decaestecker, 2001), The value of the sliding window n in the n-gram algorithm will be taken as 3 and the number of trees in the random forest algorithm is chosen as 100. Control variables in this experiment include the size of input set, the method of cleaning the obfuscated features, the method of feature selection and the method of standardized processing.

The method of cleaning the obfuscated features will effect the selection method.Therefore, the first part of experiment will be carried out without feature obfuscation value cleaning process, testing the different feature selection method on the impact of the model. In this experiment, the sample set size 1000, 2000 is to test the effect of set size and schemes on accuracy when the number of samples is lacking. The 5000 is to verify the upper limit of accuracy on schemes when the number of samples is abundance. This experiment also verifies the validity of the normalizing by the A and B series schemes. The selection criteria are adjusted manually according to the analysis of data set. The specific feature selection scheme, and the corresponding parameter selection are shown in Table 1:

The A1 and A2 schemes are used to verify the effectiveness of manual analysis and to compare with the other schemes. The B1 and B2 schemes are used to verify the effectiveness of normalizing. And the C1 schemes is the feature selection method of the paper.

In order to evaluate the effect of obfuscation feature cleaning method effectively. This paper select the C1 feature selection method, and tests the effect of different cleaning schemes on the accuracy. Considering the A1 and A2 scheme, select the features that the sum of values larger than 300 or 500. Therefore, in the E1 and E2 scheme, feature values which larger than 300 or 500 are selected for cleaning. The specific obfuscation features cleaning scheme is shown in Table 2:

E1 and E2 schemes are mainly used to compare the impact of different cleaning conditions on the final program. F1 and F2 schemes are the same in condition, the value of ξ is the threshold of obfuscation features. In addition to compare the effects of different condition with E1, E2 programs, the effects of different cleaning methods on the final results were also tested.

3.3. experimental Results and Analysis

Firstly, this paper tests the accuracy of each scheme in different data sets to verify the effectiveness of the random forest algorithm. Considering the random forest algorithm constructs the training subset by extracting the method randomly, the accuracy of each model is different. As shown in Figure 5, the relationships between times of tests and accuracy. The model is based on the A1 test scheme in different input data sets and its tested accuracy.

The abscissa indicates the number of model constructions and the ordinate indicates the accuracy of the model. From Figure 5 we can figure out that with the growth of the input data set, the accuracy will increase at the same time and the degree of volatility will reduce gradually. When the input data set reaches a certain threshold, it would be hard to improve the accuracy by increasing the data set continually. The reason is that when the data set is lacking, the feature database is also not comprehensive enough. The classifiers could be affected by the bad features, it will lead to lower and unstable on accuracy. When the data set is adequate, the accuracy of a single classifier will be improved and the predictions between the classifiers tend to be consistent.

3.3.1. Comparison of Feature Selection Schemes

In the contrast experiment of the feature selection method, the A1, A2, B1, B2 and C1 schemes verify the validity of the feature selection method. The effect of different schemes on the accuracy is described below in Figure 6:

As we can see from Figure 6. For all scheme, the accuracy will increase as the input data set increases. And the A2 scheme is able to achieve the best accuracy of 0.976. A1 scheme is the only test scheme in which the model accuracy is reduced as the input sample set grows. The reason for the decrease in accuracy will be analyzed later. For A2,B1,B2C1 scheme, it is difficult to measure the differences because their accuracy is closer. this paper compare the relationship between the accuracy and the number of features in these schemes. As shown in Figure 7 feature selection test program accuracy and characteristic relationship:

From the Figure 7 of A1, we can see that when the input sample reaches 5000, the number of features in the model training feature database is more than 8,000. Excessive quantity of features will lead to obscure model features, increased interference of obfuscation features and so on. The parameters used in the B1 and B2 test schemes are different in different input data sets and the number of features is also change greatly. For example, when the number of input samples reaches 1000, the numbers of features are much lower than the other test programs and the accuracy is also weak. While, when the number of input samples reaches 2000, you can get the best detection results.

Compared with the results of the B1, B2 test methods, C1 test method has a stronger adaptability, the number of features reach to 1000 stably. In order to verify the feature selection method proposed in this paper and ensure the stability of the final extracted feature number, this paper tests the relationship between the accuracy and the number of features in all test schemes (E1, E2, F1, F2, C1))) by using the feature selection method. At the same time, in order to compare the unused feature selection methods, this article also joined the A1, A2 test program as a comparison. As shown in Figure 8 feature selection method validation:

Figure 8 shows that the use of C1 feature selection method of the test program, the final selection of the number of features is more stable. The final accuracy of the model will increase steadily with the number of the input sample set when the variation of the characteristic number is small. Due to the dynamic selection of features, will be based on the number of the input sample set to adjust the selected features. When the input sample set is increased, the feature selection method will also improve the selection of the features of the features to achieve the features of the elimination of poor features.

3.3.2. Obfuscation Value Cleaning Scheme Comparison

Although we can guarantee the stability of the number of features used in the model by adjusting the selection method dynamically. However, the method of cleaning the obfuscated features will influence the effect of the accuracy greatly. From Figure 6 can be learned, the accuracy in C1 scheme is lowest, the final detection accuracy has a greater space for improvement. In order to evaluate the impact of the clean obfuscation features method on the model and the differences between different methods. This paper compares the C1, E1, E2, F1, F2, 5 different schemes.In particularly, the C1 scheme has no clean the obfuscation features and the others schemes has different clean method. The model accuracy of each scheme and the sample set relations as shown in Figure 9. Accuracy and input sample set relationship:

As shown in Figure 9, the E1, E2, F1, and F2 schemes higher in the final model accuracy than the C1 scheme. The obfuscation feature cleaning method improved accuracy. It is worth mentioning that, for F1, F2 scheme is the same way to find out the obfuscation threshold. But the F1 scheme reduce the obfuscation features value to the obfuscation threshold and the F2 scheme choice reduce to zero. In the F1 scheme the accuracy reducing with the increase in the number of samples. The reason is that F1 scheme changes the feature distribution of the sample by changing the value of the obfuscated features in the sample. Since the value of obfuscated features is still greater than zero after the change and this kind of the feature distribution can be learned by the model. When the input sample size is small, this way will improve the accuracy of the model. However, when the input sample continues to increase, this changing have a certain conflict with the actual feature distribution and the accuracy will be reduced . The F2 scheme clean the obfuscation features from the sample distribution. And this change has not been studied by the model, so there is no human impact the accuracy will be more real.

Since the E1, E2 and F2 schemes has adopted the C1 feature selection method, the model detection accuracy and the number of features are closer to each other . The random forest features will select features randomly.Thus,the more obfuscation features in the database, the more unstable the final result. When repeated experiments, the detection accuracy of large fluctuations means the obfuscation features clean schemes is worse. If the number of experimented is small, the results are difficult to observe.In this paper, we choose experimented 100 times for the every test schemes E1, E2 and F2. It is enough to observe the result. The cleaning effect of each test scheme is measured by comparing the degree of volatility of the final model accuracy of each test scheme. The accuracy and frequency of each obfuscation cleaning scheme are shown in Figures 10, 11, 12:

In order to measure the detection accuracy of the fluctuations of the test scenarios in Figure 10- 12 in different data sets, the standard deviation is used to measure the volatility of the final model accuracy of each test scheme in this paper. The standard deviation is calculated from the square root of the arithmetic mean of the squared difference squared, reflecting the degree of discretization of a data set. The average number of the same two sets of data, the standard deviation may not be the same. This paper counts the standard deviations of the cleaning schemes in different input samples, as shown in Table 3.

4. RELATED WORK

At present, the research of malicious detection technology is mainly focus on the feature extraction. In terms of the features extraction in malicious code (Rieck, Trinius, & Willems, 2011; Li, Santorelli, Laforest, & Coates, 2015; Yong & Zuo, 2007), they have been studied by previous researchers fully. In order to defent against obfuscation technologys, a various of methods of malicious code feature extraction have been proposed by researchers. These methods from the security attributes, dependencies, real semantics and so on. The modern extraction method focuses on the the actual semantics of malicious code. This kind of semantic-based method can express the actual behavior of malicious code. It has strong interpretability, easy to maintain the actual analysis of personnel and the development of the detection strategy. The related work in this paper will be introduced from the malicious code detection technology and feature extraction methods these two aspects.

4.1. Detection Technology

In general, malicious code detection techniques can be divided into two types of technology (Shahzad & Lavesson, 2013), including the detection method based on heuristic and the detection method based on signature. Early detection methods are mostly based on heuristic, the requirements of this detection method about the researchers’ experience and judgment are very high. For example, the Rootkit Revealer (Willems, Holz, & Freiling, 2007) detection system identifies hidden processes, files and associated registry information by comparing system upper-level information and file system status from the kernel. The detection effect of this type of detection system depends on the degree of research on the system, it does not have generalization and versatility. Therefore, most of the detection systems is based on feature database.

Machine learning model can predict variants well based on database, the modern detection systems used machine learning to train classifiers .The detection technique could be divided into three steps (Yin, Song, Egele, Kruegel, & Kirda, 2007; Narudin, Feizollah, Anuar, & Gani, 2014), the first step is the extraction of the features of malicious code. Second, removing obfuscation features, fusion and building low-dimensional features database. Finally, the machine learning algorithms use features database to train the classifier and identity the classification of malicious code.

The malicious code detection system which based on signature (Abawajy, Kelarev, & Chowdhury, 2014) constructs the feature database of a kind of malicious code by extracting the morphological features of executable binary file. After the same feature has been extracted from the malicious code to be detected, it is detected by pattern matching with the feature database. However, this method could only match a single malicious code. With the increase number of the sample, the data volume of the feature set is often very large, which is not conducive to the automatic detection of malicious code in the future. With the progress of dis-assembly technology, hidden sensitive vocabulary, special methods, resource calls could be found through the dis-assembly technology on the malicious code dynamic analysis in the malicious code samples. Therefore, researchers have proposed the detection method that based on these behavioral features. Behavior-based detection methods pay more attention to the actual behavior of malicious code and have a better ability to defence obfuscation operation.

4.2. Malicious Code Feature extraction

The use of a single feature extraction method does not have good anti-jamming and stability. When the extraction method is compromised, the monitoring system will break generally. In this case, some scholars (Kirda, Kruegel, Banks, Vigna, & Kemmerer, 2006) proposed for the malicious code for multi-feature extraction and the extraction of the features of the fusion. So that it could get more robust anti-interference malicious code family core features. This multi-feature extraction and fusion method enhances the anti-jamming of the features. The malicious code can’t fully affect the fusion features. So it is difficult to influence the final result.

Kirda and others (Rui, Feng, Yi, & Pu-Rui, 2012) use spy code to obtain user-sensitive data and then leak the behavior of the data features of the test. But this method is limited to the detection of spy malicious code, other data does not cause the disclosure of malicious code could not be detected. From the perspective of the actual semantic code of malicious code, Rui et al. (2012) and others use feature map to calculate malicious code by constructing a behavioral feature map based on malicious code semantics. It has achieved very good detection results. However, this method is based on the detection method of the program itself merely. It cannot recognize some special variants of malicious code without taking into account the program for the resource call problem.

Mao et al. (2017) and others (Zhang, Ren, Jiang, & Zhang, 2015) proposed an active learning method that solves the problem of malicious code detection when the labeled samples are less marked. The feature extraction method mainly constructs the system data flow dependency graph. The graph according to the resource scheduling relation of the sample from the angle of the malicious code resource call. The method is to use a large number of normal software behavior data and a small amount of tagged malicious code. The normal software resources would be distinguished between differentiated malicious software through the active learning way to update the malicious code classifier. In a small amount of samples of the case, the timely identification of new malicious code. This kind of method mainly describes the features from the aspects of security attribute and statistical feature. However, the expression form of this feature is not comprehensive for the future analysis of malicious code.

Nataraj and Karthikeyan (2011) first proposed the concept of malicious images. Malicious images from the malicious code executable binary file and the binary file mapping to generate grayscale malicious images. The method uses the same family of malicious code programs that will use some family of historical resource files, resulting in malicious images that are similar. The method requires a small computational cost and a strong anti - jamming capability. However, due to the increasing complexity of malicious code, for malicious code image extraction detection methods, you can also call the obfuscation of resources, etc. to bypass or interfere with the detection of features. Han, Qu, Yao, Guo, and Zhou (2014) and Han, Yao, Wu, and Guo (2014) combined with image analysis technology to extract the fingerprint feature of the image, and use the gray-level co-occurrence matrix algorithm to extract the malicious code texture fingerprint.

5. SUMMARY

Due to the feature extraction method was broken, the feature database full of obfuscation features and the malware variants escape detection. An new type of malicious code detection model is proposed in this paper. Our feature database is built on n-gram feature extraction method. At the same time, obfuscation features cleaning and feature selection method will eliminate the effects of obfuscation techniques on database. In particular, our method can dynamically determine obfuscation features and adjust the selection of family features. In such a way, the method can ensuring the effectiveness of feature database and improve the accuracy of detect variants.

When the value of obfuscation features is close to the value of family features, the cleaning method can not remove this kind of obfuscation features effectively. Thus, most of obfuscation features have been cleared by this cleaning method. Moreover, the random forest algorithm will further reduce the influence of remaining obfuscation features. In order to select the features more rationally, this paper proposes a dynamic selection method based on sample set. This method can stabilize the number of features in the database, the features will also change when the number of input samples increases.

Although the new malicious code detection model proposed in this paper can eliminate most of the obfuscation features, there are still some shortcomings. For example, the speed of operation is too low, the dependency of a large number of mark data is quite strong and so on. On the other hand, the database is very large, it is difficult for model to learn and detect new malware families that do not have a lot of tagged detal. Therefore, how to deal with fewer malicious code families will be launched in the follow-up work. The spark platform has the ability to parallelize the process of data. It is necessary to combine the test model and spark platform together. Considering that our method takes more time for a single sample process, it is need to separate the model detection and the sample feature extraction separately. At the same time, the high real-time requirements of online detection and the need to simultaneously handle multiple sample files.Therefore, the future model combined with the spark platform for the sample set to calculate and feature extraction, and return the eigenvector. Then, the feature vector is calculated by the classifier to improve the response speed of the model. We will also test the performance of the model in more malware sets.