§ 瀏覽學位論文書目資料
  
系統識別號 U0002-1802202211564000
DOI 10.6846/TKU.2022.00434
論文名稱(中文) 使用變形棧式降噪自動編碼器之邏輯斯迴歸抵擋混淆的惡意JavaScript程式偵測
論文名稱(英文) Malicious JavaScript Programs Detection Against Obfuscations Using Variant Stacked Denoising Autoencoders with Logistic Regression
第三語言論文名稱
校院名稱 淡江大學
系所名稱(中文) 資訊工程學系碩士班
系所名稱(英文) Department of Computer Science and Information Engineering
外國學位學校名稱 Cyber Security and Network
外國學位學院名稱 Queensland University of Technology
外國學位研究所名稱 Master of Information Technology
學年度 110
學期 1
出版年 111
研究生(中文) 陳子平
研究生(英文) Tzu-Ping Chen
學號 607410502
學位類別 碩士
語言別 英文
第二語言別
口試日期 2022-01-11
論文頁數 34頁
口試委員 指導教授 - 黃心嘉(sjhwang@gms.tku.edu.tw)
口試委員 - 顏嵩銘
口試委員 - 黃仁俊
關鍵字(中) 惡意程式偵測
棧式降噪自動編碼器
特徵學習
JavaScript
關鍵字(英) Malware Detection
Stacked denoised Autoencoder
Feature Learning
JavaScript
第三語言關鍵字
學科別分類
中文摘要
JavaScript 的惡意程式通常利用混淆的手法,以躲避惡意軟體的偵測。經過混淆的JavaSctipt 惡意程式,其內容通常會變得不同,導致功能相似的惡意程式的文字的特徵變成不一樣。但是這些混淆過與原始的程式在語意上的特徵仍可能具有相似性。為了使混淆過與原始的程式,即使經過混淆後,也能夠產生相似的特徵,以增進偵測方法的效能,本論文提出了初步解混淆的前處理方法,以及整合變形棧式降噪自動編碼器之邏輯斯迴歸(Stacked denoising Autoencoder–Logistic Regression ,簡稱SdA-LR) 以偵測可能被混淆惡意程式。為了使得我們變形的降噪自動編碼器(簡稱SdA) 能夠提取有相似性的特徵,改變加入噪音的方式,變成隨機的混淆原始的程式碼,再以非監督式的方式訓練我們的變形SdA模型。接著整合變形的棧式降噪自動編碼器與使用softmax函數之邏輯斯迴歸成SdA-LR,用以偵測惡意的JavaScript。我們的方法能夠有高準確率並足以抵抗混淆的手法。同時,我們的變形SdA能夠提取有相似性的特徵,以提高偵測方法的準確率,並抵擋未知的混淆手法的搗亂。另外,透過縮減我們模型的隱藏層神經元數量,保持相近的準確率前提下,減少需要消耗的計算資源。
英文摘要
JavaScript malwares often contain obfuscation to evade malware detection. The contents of these obfuscated JavaScript malwares are often changed so that the lexical features of the JavaScript malwares with similar function become distinct.  However, their semantic features may remain similarity.  In this thesis, the deobfuscation preprocessing and our variant of Stacked denoising Autoencoder-Logistic Regression (SdA-LR, for short) are proposed to enhance the detection.  To extract features with similarity among obfuscated malwares and original one, the process of adding noises is replaced with randomly obfuscation on the original JavaScript and train our variant of SdA model in unsupervised way.  The SdA is then combine with Logistic Regression layer to form our variant of Stacked denoising Autoencoder-Logistic Regression (SdA-LR, for short) model to detect whether the JavaScript is benign or malicious.  As the results, our variant of SdA-LR has high accuracy against obfuscations. Our variant of SdA is useful to extract the features with similarities then enhance the performance of detection. Our model also resists unknow obfuscations due to our variant of SdA.  To reduce the computational resources, our model can also shank the neuron numbers layer by layer of hidden layers but remains acceptable accuracy. 
第三語言摘要
論文目次
Table of Contents
Chapter 1  Introduction	1
Chapter 2  Related Work	5
	2.1 Four Obfuscation Categories	5
	2.2 Previous Works Against Obfuscation	6
	2.3 Autoencoders and the Variants	7
Chapter 3  Our Detection Model	10
	3.1 Our Model: Overview and Architectures	10
	3.2 Data Preprocessing	11
	3.3 AST Feature Vectors Generation	12
	3.4 Unsupervised Pre-Training SdA	13
	3.5 Supervised Fine-Tuning SdA-LR Detectors	15
Chapter 4  Our Experiments	17
	4.1 Experiment Setup	17
	4.2 The Size of Randomly Projected Vectors	18
	4.3 The Number of Neurons of Hidden Layers	20
	4.4 The Numbers of Hidden Layers	21
	4.5 The Epochs to Train SdA	23
	4.6 The Epochs to Fine-Tune SdA-LR	24
	4.7 Comparing with a Different Hidden Layer Structure of Our SdA	26
	4.8 An Evaluation of Our Variant Model	27
	4.9 Evaluating the Improvement of Our SdA Feature Extraction	27
	4.10 Comparing to Wang et al. [16]work	28
Chapter 5  Conclusion	30
References	31

List of Figures
Fig. 1 : An Example of an Obfuscated Program Using Random Obfuscation Techniques.	3
Fig. 2 : The Structure of an Autoencoder.	8
Fig. 3 : The Structure of a denoising Autoencoder.	9
Fig. 4 : The Architecture of Our Model.	11
Fig. 5 : The Structure of Our SdA.	14
Fig. 6 : The Structure of Our SdA-LR.	16
Fig. 7 : The Performance of Different Projected Vector Sizes.	20
Fig. 8 : The Performance of Different Hidden Layer Sizes.	21
Fig. 9 :  The Performance of Different Numbers of Hidden Layer.	22
Fig. 10 : The Performance of SdA Trained for Different Epochs.	24
Fig. 11 : The Performance of SdA-LR Using Fine-tuning for Different Epochs.	26

List of Tables
Table 1 : The Performance of Different Randomly Projected Vector Sizes	19
Table 2: The Performance of Different Hidden Layer Sizes.	21
Table 3 : The Performance of Different Numbers of Hidden Layer	22
Table 4: The Performance of Different SdA Training Epochs of Hidden Layer	24
Table 5: The Performance of Different Fine-tuning Epochs of SdA-LR.	25
Table 6: The Results of Our SdA-LR and SdA-LR with Reduced Neuron Numbers.	27
Table 7: The Performance Comparing with Different Methods.	27
Table 8: The Performance of Our SVM, Our SdA-SVM, and Our LR.	28
Table 9: A Comparison with  [16] Work	29




參考文獻
[1]	Global Data at Risk 2020 State of the Web Report. [Online] Available: https://go.talasecurity.io/hubfs/Content/White%20Papers%20and%20Reports/_Global%20Data%20at%20Risk_2020%20State%20of%20the%20Web%20Report_.pdf, Accessed: Nov. 13, 2021
[2]	JavaScript. Accessed: Nov. 2021. [Online] Available: https://developer.mozilla.org/en-US/docs/Web/JavaScript
[3]	M. F. Sohan and A. Basalamah, "A Systematic Literature Review and Quality Analysis of Javascript Malware Detection," IEEE Access, vol. 8, pp. 190539-190552, 2020.
[4]	Y. Fang, C. Huang, Y. Su, and Y. Qiu, "Detecting malicious JavaScript code based on semantic analysis," Computers & Security, vol. 93, p. 101764, 2020.
[5]	P. Skolka, C.-A. Staicu, and M. Pradel, "Anything to hide? Studying minified and obfuscated code in the web," in The World Wide Web Conference, 2019, pp. 1735-1746.
[6]	M. Yousefi-Azar, V. Varadharajan, L. Hamey, and U. Tupakula, "Autoencoder-based feature learning for cyber security applications," in 2017 International joint conference on neural networks (IJCNN), 2017, pp. 3854-3861.
[7]	W. Xu, F. Zhang, and S. Zhu, "Jstill: mostly static detection of obfuscated malicious javascript code," in Proceedings of the third ACM conference on Data and application security and privacy, 2013, pp. 117-128.
[8]	D. R. Patil and J. Patil, "Detection of malicious javascript code in web pages," Indian Journal of Science and Technology, vol. 10, pp. 1-12, 2017.
[9]	S. Ndichu, S. Ozawa, T. Misu, and K. Okada, "A machine learning approach to malicious JavaScript detection using fixed length vector representation," in 2018 International Joint Conference on Neural Networks (IJCNN), 2018, pp. 1-8.
[10]	J. W. Stokes, R. Agrawal, G. McDonald, and M. Hausknecht, "Scriptnet: Neural static analysis for malicious javascript detection," in MILCOM 2019-2019 IEEE Military Communications Conference (MILCOM), 2019, pp. 1-8.
[11]	X. Song, C. Chen, B. Cui, and J. Fu, "Malicious JavaScript Detection Based on Bidirectional LSTM Model," Applied Sciences, vol. 10, p. 3440, 2020.
[12]	A. Fass, M. Backes, and B. Stock, "Jstap: A static pre-filter for malicious javascript detection," in Proceedings of the 35th Annual Computer Security Applications Conference, 2019, pp. 257-269.
[13]	S. Ndichu, S. Kim, S. Ozawa, T. Misu, and K. Makishima, "A machine learning approach to detection of JavaScript-based attacks using AST features and paragraph vectors," Applied Soft Computing, vol. 84, p. 105721, 2019.
[14]	W. Xu, F. Zhang, and S. Zhu, "The power of obfuscation techniques in malicious JavaScript code: A measurement study," in 2012 7th International Conference on Malicious and Unwanted Software, 2012, pp. 9-16.
[15]	S. Ndichu, S. Kim, and S. Ozawa, "Deobfuscation, unpacking, and decoding of obfuscated malicious JavaScript for machine learning models detection performance improvement," CAAI Transactions on Intelligence Technology, vol. 5, pp. 184-192, 2020.
[16]	Y. Wang, W. d. Cai, and P. c. Wei, "A deep learning approach for detecting malicious JavaScript code," Security and Communication Networks, vol. 9, pp. 1520-1534, 2016.
[17]	js-beautifier. Accessed: Nov. 2021. [Online] Available: https://github.com/beautify-web/js-beautify
[18]	Esprima.  Accessed: Nov. 2021.  Available: https://esprima.org/
[19]	P. Li, T. J. Hastie, and K. W. Church, "Very sparse random projections," in Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, 2006, pp. 287-296.
[20]	JavaScript obfuscator. Accessed: Nov. 2021. [Online] Available: https://github.com/javascript-obfuscator/javascript-obfuscator
[21]	K. Kayabol, "Approximate sparse multinomial logistic regression for classification," IEEE transactions on pattern analysis and machine intelligence, vol. 42, pp. 490-493, 2019.
[22]	J. L. L. Herrera, H. V. R. Figueroa, and E. J. R. Ramírez, "Deep fraud. A fraud intention recognition framework in public transport context using a deep-learning approach," in 2018 International Conference on Electronics, Communications and Computers (CONIELECOMP), 2018, pp. 118-125.
[23]	V. Raychev, P. Bielik, M. Vechev, and A. Krause, "Learning programs from noisy data," ACM Sigplan Notices, vol. 51, pp. 761-774, 2016.

論文全文使用權限
國家圖書館
同意無償授權國家圖書館,書目與全文電子檔於2022-02-24, 於網際網路公開
校內
校內紙本論文立即公開
同意電子論文全文授權校園內公開
校內電子論文立即公開
校外
同意授權
校外電子論文延後至2024-02-18公開,延後電子全文

如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信