§ Browsing ETD Metadata
System No. U0002-1802202211564000
Title (in Chinese) 使用變形棧式降噪自動編碼器之邏輯斯迴歸抵擋混淆的惡意JavaScript程式偵測
Title (in English) Malicious JavaScript Programs Detection Against Obfuscations Using Variant Stacked Denoising Autoencoders with Logistic Regression
Other Title
Institution 淡江大學
Department (in Chinese) 資訊工程學系碩士班
Department (in English) Department of Computer Science and Information Engineering
Other Division Cyber Security and Network
Other Division Name Queensland University of Technology
Other Department/Institution Master of Information Technology
Academic Year 110
Semester 1
PublicationYear 111
Author's name (in Chinese) 陳子平
Author's name(in English) Tzu-Ping Chen
Student ID 607410502
Degree 碩士
Language English
Other Language
Date of Oral Defense 2022-01-11
Pagination 34page
Committee Member advisor - Shin-Jia Hwang
co-chair - Sung-Ming Yen
co-chair - Ren Junn Hwang
Keyword (inChinese) 惡意程式偵測
Keyword (in English) Malware Detection
Stacked denoised Autoencoder
Feature Learning
Other Keywords
Abstract (in Chinese)
JavaScript 的惡意程式通常利用混淆的手法,以躲避惡意軟體的偵測。經過混淆的JavaSctipt 惡意程式,其內容通常會變得不同,導致功能相似的惡意程式的文字的特徵變成不一樣。但是這些混淆過與原始的程式在語意上的特徵仍可能具有相似性。為了使混淆過與原始的程式,即使經過混淆後,也能夠產生相似的特徵,以增進偵測方法的效能,本論文提出了初步解混淆的前處理方法,以及整合變形棧式降噪自動編碼器之邏輯斯迴歸(Stacked denoising Autoencoder–Logistic Regression ,簡稱SdA-LR) 以偵測可能被混淆惡意程式。為了使得我們變形的降噪自動編碼器(簡稱SdA) 能夠提取有相似性的特徵,改變加入噪音的方式,變成隨機的混淆原始的程式碼,再以非監督式的方式訓練我們的變形SdA模型。接著整合變形的棧式降噪自動編碼器與使用softmax函數之邏輯斯迴歸成SdA-LR,用以偵測惡意的JavaScript。我們的方法能夠有高準確率並足以抵抗混淆的手法。同時,我們的變形SdA能夠提取有相似性的特徵,以提高偵測方法的準確率,並抵擋未知的混淆手法的搗亂。另外,透過縮減我們模型的隱藏層神經元數量,保持相近的準確率前提下,減少需要消耗的計算資源。
Abstract (in English)
JavaScript malwares often contain obfuscation to evade malware detection. The contents of these obfuscated JavaScript malwares are often changed so that the lexical features of the JavaScript malwares with similar function become distinct.  However, their semantic features may remain similarity.  In this thesis, the deobfuscation preprocessing and our variant of Stacked denoising Autoencoder-Logistic Regression (SdA-LR, for short) are proposed to enhance the detection.  To extract features with similarity among obfuscated malwares and original one, the process of adding noises is replaced with randomly obfuscation on the original JavaScript and train our variant of SdA model in unsupervised way.  The SdA is then combine with Logistic Regression layer to form our variant of Stacked denoising Autoencoder-Logistic Regression (SdA-LR, for short) model to detect whether the JavaScript is benign or malicious.  As the results, our variant of SdA-LR has high accuracy against obfuscations. Our variant of SdA is useful to extract the features with similarities then enhance the performance of detection. Our model also resists unknow obfuscations due to our variant of SdA.  To reduce the computational resources, our model can also shank the neuron numbers layer by layer of hidden layers but remains acceptable accuracy. 
Other Abstract
Table of Content (with Page Number)
Table of Contents
Chapter 1  Introduction	1
Chapter 2  Related Work	5
	2.1 Four Obfuscation Categories	5
	2.2 Previous Works Against Obfuscation	6
	2.3 Autoencoders and the Variants	7
Chapter 3  Our Detection Model	10
	3.1 Our Model: Overview and Architectures	10
	3.2 Data Preprocessing	11
	3.3 AST Feature Vectors Generation	12
	3.4 Unsupervised Pre-Training SdA	13
	3.5 Supervised Fine-Tuning SdA-LR Detectors	15
Chapter 4  Our Experiments	17
	4.1 Experiment Setup	17
	4.2 The Size of Randomly Projected Vectors	18
	4.3 The Number of Neurons of Hidden Layers	20
	4.4 The Numbers of Hidden Layers	21
	4.5 The Epochs to Train SdA	23
	4.6 The Epochs to Fine-Tune SdA-LR	24
	4.7 Comparing with a Different Hidden Layer Structure of Our SdA	26
	4.8 An Evaluation of Our Variant Model	27
	4.9 Evaluating the Improvement of Our SdA Feature Extraction	27
	4.10 Comparing to Wang et al. [16]work	28
Chapter 5  Conclusion	30
References	31

List of Figures
Fig. 1 : An Example of an Obfuscated Program Using Random Obfuscation Techniques.	3
Fig. 2 : The Structure of an Autoencoder.	8
Fig. 3 : The Structure of a denoising Autoencoder.	9
Fig. 4 : The Architecture of Our Model.	11
Fig. 5 : The Structure of Our SdA.	14
Fig. 6 : The Structure of Our SdA-LR.	16
Fig. 7 : The Performance of Different Projected Vector Sizes.	20
Fig. 8 : The Performance of Different Hidden Layer Sizes.	21
Fig. 9 :  The Performance of Different Numbers of Hidden Layer.	22
Fig. 10 : The Performance of SdA Trained for Different Epochs.	24
Fig. 11 : The Performance of SdA-LR Using Fine-tuning for Different Epochs.	26

List of Tables
Table 1 : The Performance of Different Randomly Projected Vector Sizes	19
Table 2: The Performance of Different Hidden Layer Sizes.	21
Table 3 : The Performance of Different Numbers of Hidden Layer	22
Table 4: The Performance of Different SdA Training Epochs of Hidden Layer	24
Table 5: The Performance of Different Fine-tuning Epochs of SdA-LR.	25
Table 6: The Results of Our SdA-LR and SdA-LR with Reduced Neuron Numbers.	27
Table 7: The Performance Comparing with Different Methods.	27
Table 8: The Performance of Our SVM, Our SdA-SVM, and Our LR.	28
Table 9: A Comparison with  [16] Work	29

[1]	Global Data at Risk 2020 State of the Web Report. [Online] Available: https://go.talasecurity.io/hubfs/Content/White%20Papers%20and%20Reports/_Global%20Data%20at%20Risk_2020%20State%20of%20the%20Web%20Report_.pdf, Accessed: Nov. 13, 2021
[2]	JavaScript. Accessed: Nov. 2021. [Online] Available: https://developer.mozilla.org/en-US/docs/Web/JavaScript
[3]	M. F. Sohan and A. Basalamah, "A Systematic Literature Review and Quality Analysis of Javascript Malware Detection," IEEE Access, vol. 8, pp. 190539-190552, 2020.
[4]	Y. Fang, C. Huang, Y. Su, and Y. Qiu, "Detecting malicious JavaScript code based on semantic analysis," Computers & Security, vol. 93, p. 101764, 2020.
[5]	P. Skolka, C.-A. Staicu, and M. Pradel, "Anything to hide? Studying minified and obfuscated code in the web," in The World Wide Web Conference, 2019, pp. 1735-1746.
[6]	M. Yousefi-Azar, V. Varadharajan, L. Hamey, and U. Tupakula, "Autoencoder-based feature learning for cyber security applications," in 2017 International joint conference on neural networks (IJCNN), 2017, pp. 3854-3861.
[7]	W. Xu, F. Zhang, and S. Zhu, "Jstill: mostly static detection of obfuscated malicious javascript code," in Proceedings of the third ACM conference on Data and application security and privacy, 2013, pp. 117-128.
[8]	D. R. Patil and J. Patil, "Detection of malicious javascript code in web pages," Indian Journal of Science and Technology, vol. 10, pp. 1-12, 2017.
[9]	S. Ndichu, S. Ozawa, T. Misu, and K. Okada, "A machine learning approach to malicious JavaScript detection using fixed length vector representation," in 2018 International Joint Conference on Neural Networks (IJCNN), 2018, pp. 1-8.
[10]	J. W. Stokes, R. Agrawal, G. McDonald, and M. Hausknecht, "Scriptnet: Neural static analysis for malicious javascript detection," in MILCOM 2019-2019 IEEE Military Communications Conference (MILCOM), 2019, pp. 1-8.
[11]	X. Song, C. Chen, B. Cui, and J. Fu, "Malicious JavaScript Detection Based on Bidirectional LSTM Model," Applied Sciences, vol. 10, p. 3440, 2020.
[12]	A. Fass, M. Backes, and B. Stock, "Jstap: A static pre-filter for malicious javascript detection," in Proceedings of the 35th Annual Computer Security Applications Conference, 2019, pp. 257-269.
[13]	S. Ndichu, S. Kim, S. Ozawa, T. Misu, and K. Makishima, "A machine learning approach to detection of JavaScript-based attacks using AST features and paragraph vectors," Applied Soft Computing, vol. 84, p. 105721, 2019.
[14]	W. Xu, F. Zhang, and S. Zhu, "The power of obfuscation techniques in malicious JavaScript code: A measurement study," in 2012 7th International Conference on Malicious and Unwanted Software, 2012, pp. 9-16.
[15]	S. Ndichu, S. Kim, and S. Ozawa, "Deobfuscation, unpacking, and decoding of obfuscated malicious JavaScript for machine learning models detection performance improvement," CAAI Transactions on Intelligence Technology, vol. 5, pp. 184-192, 2020.
[16]	Y. Wang, W. d. Cai, and P. c. Wei, "A deep learning approach for detecting malicious JavaScript code," Security and Communication Networks, vol. 9, pp. 1520-1534, 2016.
[17]	js-beautifier. Accessed: Nov. 2021. [Online] Available: https://github.com/beautify-web/js-beautify
[18]	Esprima.  Accessed: Nov. 2021.  Available: https://esprima.org/
[19]	P. Li, T. J. Hastie, and K. W. Church, "Very sparse random projections," in Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, 2006, pp. 287-296.
[20]	JavaScript obfuscator. Accessed: Nov. 2021. [Online] Available: https://github.com/javascript-obfuscator/javascript-obfuscator
[21]	K. Kayabol, "Approximate sparse multinomial logistic regression for classification," IEEE transactions on pattern analysis and machine intelligence, vol. 42, pp. 490-493, 2019.
[22]	J. L. L. Herrera, H. V. R. Figueroa, and E. J. R. Ramírez, "Deep fraud. A fraud intention recognition framework in public transport context using a deep-learning approach," in 2018 International Conference on Electronics, Communications and Computers (CONIELECOMP), 2018, pp. 118-125.
[23]	V. Raychev, P. Bielik, M. Vechev, and A. Krause, "Learning programs from noisy data," ACM Sigplan Notices, vol. 51, pp. 761-774, 2016.

Terms of Use
National Central Library
Agreed to grant my copyright without royalty for NCL.and the electronic file of the bibliography and full text will be published on the Internet on 2022-02-24.
Within Campus
On-campus access to my hard copy thesis/dissertation is open immediately
Agree to authorize disclosure on campus
Release immediately
Outside the Campus
I grant the authorization for the public to view/print my electronic full text with royalty fee and I donate the fee to my school library as a development fund.
Duration for delaying release from 2024-02-18. Synchronize delayed electronic full text

If you have any questions, please contact us!

Library: please call (02)2621-5656 ext. 2487 or email