電子學位論文服務

§ 瀏覽學位論文書目資料

本論文電子全文於2017-07-05起於校外公開使用
本論文紙本於2017-07-05起公開使用

系統識別號	U0002-2606201716585000
DOI	10.6846/TKU.2017.00933
論文名稱(中文)	適用於分類變數資料的二元不平衡資料自動分類系統
論文名稱(英文)	Automatic Binary Classification System for Imbalanced Data with Categorical Explanatory Variables
第三語言論文名稱
校院名稱	淡江大學
系所名稱(中文)	統計學系應用統計學碩士班
系所名稱(英文)	Department of Statistics
外國學位學校名稱
外國學位學院名稱
外國學位研究所名稱
學年度	105
學期	2
出版年	106
研究生(中文)	葉丞峻
研究生(英文)	Cheng-Chun Yeh
學號	604650084
學位類別	碩士
語言別	繁體中文
第二語言別
口試日期	2017-06-17
論文頁數	43頁
口試委員	指導教授 - 陳景祥委員 - 陳景祥委員 - 李百靈委員 - 何宗武
關鍵字(中)	類別不平衡資料探勘分類技術資料複雜度
關鍵字(英)	imbalanced data data mining classifier data complexity
第三語言關鍵字
學科別分類
中文摘要	隨著科技的進步，許多產業都應用自動化的作業模式，使得現今人類的生活更便利也更有效率。若我們能將自動化的概念導入資料分析的流程中，便能使資料分析者在作業上的負擔降低。本研究參考了資料複雜度指標對常見分類技術的影響，針對二元分類的類別不平衡資料，使用三種不同的重抽樣方法對資料進行類別的平衡，期望能夠建立一個有效的類別不平衡資料自動二元分類系統。研究結果顯示，本文提出的方法能夠有效的選出最好的分類技術。整體而言，羅吉斯迴歸在二元分類不平衡問題有較好的表現。
英文摘要	As technology advances, automated operations are used by many industries, it makes human life much easier and more efficient. Automated operations will reduce the burden on the data analyst if the concept of automation can be imported into the data analysis. In this study, influences of data complex indices on common classifier are evaluated and three different re-sampling methods are used for binray imbalanced data. The results show that our proposed procedure can effectively select the best classifier. For binary classification of imbalanced data, the Logistic regression has a better performance.
第三語言摘要
論文目次	目錄 I 表目錄 III 圖目錄 IV 第一章緒論 1 第一節研究背景 1 第二節研究動機與目的 2 第三節論文架構 3 第二章文獻探討 4 第一節資料複雜度指標 4 1. F1(最大Fisher判別比)： 4 2. L1(最小化線性規劃目標函數值)： 5 4. N2(平均組內及組間最鄰近點的距離比值)： 7 5. C1(類別平衡程度)： 8 6. 分類技術選擇建議 9 第二節分類技術 10 1. CART決策樹 10 2. 最近鄰近點法 11 3. 單純貝氏分類器 12 4. 線性判別分析 13 5. 羅吉斯迴歸 15 6. 支持向量機 16 第三節不平衡資料的修正方法 18 1. 超抽樣 19 2. 欠抽樣 19 3. ROSE抽樣 19 第四節 CRIMCOORD變數轉換 20 第三章研究方法 22 第一節系統處理流程 22 1. 資料處理 22 2. 模型建構 23 第二節類別變數的複雜度指標 23 C2指標： 24 C3指標： 24 C4指標： 24 第三節分類技術的評估 25 第四章實例分析 27 第一節資料簡介 27 第二節系統輸出彙整與比較 28 第三節不平衡資料修正方法對於分類技術的影響 31 第四節類別變數的複雜度指標對分類技術的影響 35 第五章結論與建議 40 第一節結論 40 第二節研究建議 41 參考文獻中文文獻 42 英文文獻 43 表目錄表 1 分類技術選擇建議 9 表 2 資料集資訊 27 表 3 模型挑選彙整表 29 表 4 資料複雜度指標值 30 表 5 資料修正後的表現 34 表 6 類別型資料複雜度指標值 36 圖目錄圖 1 研究流程圖 3 圖 2 MST示意圖 6 圖 3 LDA示意圖 13 圖 4 SVM示意圖 16 圖 5 ROC曲線示意圖 26 圖 6 抽樣方法對分類技術的影響 33 圖 7 C2指標對分類技術的影響 37 圖 8 C3指標對分類技術的影響 38 圖 9 C4指標對分類技術的影響 39
參考文獻	中文文獻 1. 王派洲譯；Han, J., Kamber, M.著(2008)，資料探勘：概念與方法，臺中市：滄海。 2. 沈彥廷(2012)，「資料複雜度指標對資料探勘分類技術的影響」，淡江大學統計學系應用統計學碩士班碩士論文。 3. 施雅月、賴錦慧譯；Tan, P.N., Steinbach, M. and Kumar, V.著(2007)，資料探勘，臺北市：臺灣培生教育。 4. 陳景祥(2010)，R軟體：應用統計方法，台北：台灣東華英文文獻 1. Friedman, J.H. and Rafsky, L.C. (1979), Multivariate Generalizations of the Wald-Wolfowitz and Smirnov Two-Sample Tests, The Annals of Statistics, 7, 697-717. 2. Giovanna Menardi and Nicola Torelli (2014), Training and assessing classification rules with imbalanced data. Data Mining and Konwledge Discovery 28, 92–122. 3. Gnanadesikan, R. (1997), Methods for Statistical Data Analysis of Multivariate Observations. Wiley, New York. 4. Kalousis, A., Gama,J. and Hilario, M. (2004), On data and algorithms: understanding inductive performance, Machine Learning, 54, 275-312. 5. Loh, W. Y., & Shih, Y, S. (1997). Split selection methods for classification trees. Statistica sinica, 7, 815-840 6. Nicola Lunardon, Giovanna Menardi, and Nicola Torelli (2014), ROSE: A Package for Binary Imbalanced Learning. The R Journal 2014 Issue 1, 79-89. 7. Smith, F.W. (1968), Pattern Classifier Design by Linear Programming, Transactions on Computers, 17, 367-372. 8. Swets, J.A. (1996). Signal detection theory and ROC analysis in psychology and diagnostics:collected papers. Lawrence Erlbaum Associates. 9. ZHEXUE HUANG (1998), Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values. Data Mining and Knowledge Discovery 2, 283–304.
論文全文使用權限	校內：校內紙本論文立即公開同意電子論文全文授權校園內公開校內電子論文立即公開校外：同意授權校外電子論文立即公開

返回頁首

如有問題，歡迎洽詢！
圖書館數位資訊組　(02)2621-5656 轉 2487 或來信