| 系統識別號 | U0002-2501202402071400 |
|---|---|
| DOI | 10.6846/tku202400075 |
| 論文名稱(中文) | R語言區間象徵型資料矩陣視覺化探索式分析平台:iGAP套件 |
| 論文名稱(英文) | iGAP:Matrix visualization for interval value based on GAP framework in R |
| 第三語言論文名稱 | |
| 校院名稱 | 淡江大學 |
| 系所名稱(中文) | 統計學系應用統計學碩士班 |
| 系所名稱(英文) | Department of Statistics |
| 外國學位學校名稱 | |
| 外國學位學院名稱 | |
| 外國學位研究所名稱 | |
| 學年度 | 112 |
| 學期 | 1 |
| 出版年 | 113 |
| 研究生(中文) | 陳紹安 |
| 研究生(英文) | Shao-An Chen |
| 學號 | 610650193 |
| 學位類別 | 碩士 |
| 語言別 | 繁體中文 |
| 第二語言別 | |
| 口試日期 | 2024-01-10 |
| 論文頁數 | 49頁 |
| 口試委員 |
指導教授
-
高君豪(157294@mail.tku.edu.tw)
口試委員 - 楊文 口試委員 - 吳漢銘 |
| 關鍵字(中) |
廣義相關圖 探索性分析 象徵型資料分析 區間資料 資料視覺化 排序演算法 R套件 |
| 關鍵字(英) |
Generalized Association Plots Exploratory Data Analysis Symbolic Data Analysis Interval Data Data Visualization Sorting Algorithm R Package |
| 第三語言關鍵字 | |
| 學科別分類 | |
| 中文摘要 |
R 語言在統計領域非常熱門,許多新穎的統計研究方法都會發布成 R 語言的套件。本研究旨在 R 語言中實現廣義相關圖(Generalized Association Plots; GAP)以處理區間資料,能夠被 R 語言龐大的象徵型區間資料庫與演算法套件所支援。GAP 是矩陣視覺化(Matrix Visualization; MV)的擴展。特別著重於象徵型區間資料的探索性分析 (Exploratory data analysis; EDA),在進行複雜統計方法之前能讓資料自述其故事。對此本研究開發出一款套件名為 iGAP,將 GAP 概念引入 R 中,使區間資料進行排列並視覺化,讓使用者能快速且清晰的了解到資料之間的關係與型樣。我們相信 GAP 作為一個進階的探索性資料分析工具,在 R 中不僅能應用在區間資料上,未來也能對各種資料型態進行推廣。 |
| 英文摘要 |
R language is highly popular in the statistical field, with many innovative statistical methods being released as R packages. This study aims to implement Generalized Association Plots (GAP) for interval data in R, supported by a vast repository of symbolic interval data and algorithmic packages in R. GAP is an extension of Matrix Visualization (MV). Focusing particularly on the Exploratory Data Analysis (EDA) of symbolic interval data, the study facilitates data to narrate its story before applying complex statistical methods. To this end, we have developed a package named iGAP, integrating the GAP concept into R, enabling the arrangement and visualization of interval data for users to quickly and clearly understand the relationships and patterns within the data. We believe that GAP, as an advanced tool for EDA in R, can be applied to interval data and potentially extended to various data types in the future. |
| 第三語言摘要 | |
| 論文目次 |
目錄 第一章 緒論 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 研究背景與動機 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 研究目的 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 論文架構 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 第二章 文獻探討. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1 象徵型資料分析 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 廣義相關圖 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3 象徵型區間資料矩陣視覺化 . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3.1 區間資料變量相關性關係矩陣 . . . . . . . . . . . . . . . . . . . 7 2.3.2 區間資料概念距離關係矩陣 . . . . . . . . . . . . . . . . . . . . 9 2.4 排序演算法 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 第三章 套件設計與相關套件 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.1 iGAP 設計流程 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 資料預處理與建立 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3 象徵型資料處理 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.4 視覺化技術 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.5 iGAP 函式總結 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 第四章 實例應用. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.1 汽車資料 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.2 食用油資料 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 第五章 結論與未來工作 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.1 研究總結 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.2 未來研究方向 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 圖目錄 圖 2.1 一般資料和象徵型區間資料矩陣與其變量之間相關性矩陣及概念之 間距離矩陣圖 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 圖 2.2 區間資料矩陣視覺化漸層上色 . . . . . . . . . . . . . . . . . . . . . 7 圖 3.1 iGAP 套件設計流程圖 . . . . . . . . . . . . . . . . . . . . . . . . . 16 圖 3.2 ex1_db2so 區間資料量化後的關係矩陣視覺化 . . . . . . . . . . . . 23 圖 3.3 區間資料標準化上色 . . . . . . . . . . . . . . . . . . . . . . . . . . 24 圖 3.4 兩個關係矩陣進行 R2E 排序 . . . . . . . . . . . . . . . . . . . . . . 25 圖 3.5 uscrime_int 區間資料矩陣視覺化 . . . . . . . . . . . . . . . . . . . 27 圖 3.6 RColorBrewer 色譜 . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 圖 3.7 ex1_db2so 區間資料廣義相關圖,參數變數為預設 . . . . . . . . . 31 圖 4.1 Cars.int 區間資料廣義相關圖,Concept 間距離採用 Span Normalized Euclidean Hausdorff distance,變數相關性採用 Empirical Correlation Coefficient,並且使用 R2E 進行排序 . . . . . . . . . . . . . . . . 35 圖 4.2 oils 區間資料廣義相關圖,Concept 間距離採用 Span Normalized Euclidean Hausdorff distance,變數相關性採用 Empirical Correlation Coefficient,並且使用 Single-linkage agglomerative algorithm 進行排序 37 圖 5.1 iGAP 套件介面化示意圖 . . . . . . . . . . . . . . . . . . . . . . . . 41 表目錄 表 3.1 R 語言 dataSDA 套件內建的區間資料 . . . . . . . . . . . . . . . . 19 表 3.2 R 語言 RSDA 套件內建的區間資料 . . . . . . . . . . . . . . . . . . 20 表 3.3 R 語言 ggESDA 套件內建的區間資料 . . . . . . . . . . . . . . . . . 20 表 3.4 運算環境 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 表 3.5 iGAP 函式總結 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 |
| 參考文獻 |
參考文獻 [1] Tukey, J.W., et al., 1977. Exploratory data analysis. volume 2. Reading, MA. [2] Chen, C.H., 2002. Generalized association plots: Information visualization via iteratively generated correlation matrices. Statistica Sinica , 7–29 [3] Wu, H.M., Tien, Y.J., Chen, C.H., 2010. GAP: A graphical environment for matrix visualization and cluster analysis. Computational Statistics and Data Analysis 54, 767–778. [4] Kao, C.H., Nakano, J., Shieh, S.H., Tien, Y.J., Wu, H.M., Yang, C.k., Chen, C.h., 2014. Exploratory data analysis of interval-valued symbolic data with matrix visualization. Computational Statistics & Data Analysis 79, 14–29. [5] Denoeux, T., Masson, M., 2000. Multidimensional scaling of interval-valued dissimilarity data. Pattern Recognition Letters 21, 83–92. [6] Billard, L., Diday, E., 2000. Regression analysis for interval-valued data, in: Kiers, H.A.L., Rasson, J.P., Groenen, P.J.F., Schader, M. (Eds.), Data Analysis, Classification, and Related Methods, Springer Berlin Heidelberg, Berlin, Heidelberg. pp. 369–374. [7] Bock, H.H., Diday, E., 1999. Analysis of symbolic data: exploratory methods for extracting statistical information from complex data. Springer Science & Business Media. [8] Billard, L., Diday, E., 2003. From the statistics of data to the statistics of knowledge: symbolic data analysis. Journal of the American Statistical Association 98, 470–487. [9] Billard, L., Diday, E., 2006. Symbolic data analysis: Conceptual statistics and data mining john wiley. Search in . [10] Diday, E., Noirhomme-Fraiture, M., 2008. Symbolic data analysis and the SODAS software. John Wiley & Sons. [11] Chouakria, A., Cazes, P., Diday, E., Bock, H., 2000. Symbolic principal component analysis. Analysis of Symbolic Data, ed. HH Bock, and E. Diday , 200–212. [12] Palumbo, F., Lauro, C.N., 2003. A pca for interval-valued data based on midpoints and radii, in: New developments in Psychometrics. Springer, pp. 641–648. [13] Gioia, F., Lauro, C.N., 2006. Principal component analysis on interval data. Computational Statistics 21, 343–363. [14] Hamada, A., Minami, H., Mizuta, M., 2008. Principal component analysis for modal interval-valued data, in: Proceedings of IASC2008, the Joint Meeting of 4th World Conference of the IASC and 6th Conference of the Asian Regional Section of the IASC on Computational Statistics & Data Analysis, pp. 512–519. [15] Brito, P., 2002. Hierarchical and pyramidal clustering for symbolic data. Journal of the Japanese Society of Computational Statistics 15, 231–244. [16] Hans-Hermann, B., 2003. Clustering algorithms and kohonen maps for symbolic data. Journal of the Japanese Society of Computational Statistics 15, 217–229. [17] de Souza, R.M., De Carvalho, F.d.A., 2004. Clustering of interval data based on city–block distances. Pattern Recognition Letters 25, 353–365. [18] Golli, A.E., Conan-Guez, B., Rossi, F., 2004. A self-organizing map for dissimilarity data, in: Classification, Clustering, and Data Mining Applications. Springer, pp. 61–68. [19] Chavent, M., de Carvalho, F.d.A., Lechevallier, Y., Verde, R., 2006. New clustering methods for interval data. Computational statistics 21, 211–229. [20] De Carvalho, F.d.A., Brito, P., Bock, H.H., 2006. Dynamic clustering for interval data based on l 2 distance. Computational Statistics 21, 231–250. [21] Bock, H.H., 2008. Visualizing symbolic data by kohonen maps. Symbolic Data Analysis and the SODAS Software, Wiley , 205–234. [22] Lauro, N., Verde, R., Palumbo, F., 2000. Factorial discriminant analysis on symbolic objects. [23] Silva, A.P.D., Brito, P., 2006. Linear discriminant analysis for interval data. Computational Statistics 21, 289–308. [24] Billard, L., Diday, E., 2000. Regression analysis for interval-valued data, in: Data analysis, classification, and related methods. Springer, pp. 369–374. [25] De Carvalho, F., Neto, E., 2008. Centre and range method for fitting a linear regression model to symbolic intervalar data. Computational Statistics & Data Analysis 52, 1500–1515. [26] Neto, E.d.A.L., De Carvalho, F.D.A., 2010. Constrained linear regression models for symbolic interval-valued variables. Computational Statistics & Data Analysis 54, 333–347. [27] Verde, R., Lechevallier, Y., 2005. Crossed clustering method on symbolic data tables, in: New developments in classification and data analysis. Springer, pp. 87–94. [28] Chen, C.H., Hwu, H.G., Jang, W.J., Kao, C.H., Tien, Y.J., Tzeng, S., Wu, H.M., 2004. Matrix visualization and information mining, in: COMPSTAT 2004—Proceedings in Computational Statistics, Springer. pp. 85–100. [29] Ghoniem, M., Fekete, J.D., Castagliola, P., 2005. On the readability of graphs using node-link and matrix-based representations: a controlled experiment and statistical analysis. Information visualization 4, 114–135. [30] Henry, N., Fekete, J.D., 2006. Matrixexplorer: a dual-representation system to explore social networks. IEEE transactions on visualization and computer graphics 12, 677–684. [31] Liiv, I., 2010. Seriation and matrix reordering methods: An historical overview. Statistical Analysis and Data Mining: The ASA Data Science Journal 3, 70–91. [32] Wu, H.M., Tzeng, S., Chen, C.h., 2008. Matrix visualization, in: Handbook of data visualization. Springer, pp. 681–708. [33] Bertin, J., 1983. Semiology of graphics. University of Wisconsin press. [34] Tien, Y.J., Lee, Y.S., Wu, H.M., Chen, C.H., 2008. Methods for simultaneously identifying coherent local clusters with smooth global patterns in gene expression profiles. BMC bioinformatics 9, 1–16. [35] Wu, H.M., Tien, Y.J., Chen, C.h., 2010. Gap: A graphical environment for matrix visualization and cluster analysis. Computational Statistics & Data Analysis 54, 767–778. [36] Liiv, I., Opik, R., Ubi, J., Stasko, J., 2012. Visual matrix explorer for collaborative seriation. Wiley Interdisciplinary Reviews: Computational Statistics 4, 85–97. [37] Pearson, K., 1948. Early statistical papers. University Press. [38] Billard, L., 2008. Sample covariance functions for complex quantitative data, in: Proceedings of World IASC Conference, Yokohama, Japan, pp. 157–163. [39] Gowda, K.C., Diday, E., 1991. Symbolic clustering using a new dissimilarity measure. pattern recognition 24, 567–578. [40] Ichino, M., 1988. General metrics for mixed features the cartesian space theory for pattern recognition, in: Proceedings of the 1988 IEEE International Conference on Systems, Man, and Cybernetics, IEEE. pp. 494–497. [41] Chavent, M., Lechevallier, Y., 2002. Dynamical clustering of interval data: Optimization of an adequacy criterion based on hausdorff distance, in: Classification, Clustering, and Data Analysis: Recent Advances and Applications. Springer, pp. 53–60. [42] Hahsler, M., Hornik, K., Buchta, C., 2008. Getting things in order: an introduction to the r package seriation. Journal of Statistical Software 25, 1–34. [43] Gruvaeus, G., Wainer, H., 1972. Two additions to hierarchical cluster analysis. British Journal of Mathematical and Statistical Psychology 25, 200–206. [44] Bar-Joseph, Z., Gifford, D.K., Jaakkola, T.S., 2001. Fast optimal leaf ordering for hierarchical clustering. Bioinformatics 17, S22–S29. [45] Cox, T.F., Cox, M.A., 2000. Multidimensional scaling. CRC press. [46] Sammon, J.W., 1969. A nonlinear mapping for data structure analysis. IEEE Transactions on computers 100, 401–409. [47] Van Laarhoven, P.J., Aarts, E.H., van Laarhoven, P.J., Aarts, E.H., 1987. Simulated annealing. Springer. [48] Bohachevsky, I.O., Johnson, M.E., Stein, M.L., 1986. Generalized simulated annealing for function optimization. Technometrics 28, 209–217. [49] Bottou, L., 2012. Stochastic gradient descent tricks, in: Neural Networks: Tricks of the Trade: Second Edition. Springer, pp. 421–436. [50] Lawler, E.L., 1963. The quadratic assignment problem. Management science 9, 586–599. [51] Gutin, G., Yeo, A., Zverovich, A., 2002. Traveling salesman should not be greedy: domination analysis of greedy-type heuristics for the tsp. Discrete Applied Mathematics 117, 81–86. [52] Tsafrir, D., Tsafrir, I., Ein-Dor, L., Zuk, O., Notterman, D.A., Domany, E., 2005. Sorting points into neighborhoods (spin): data analysis and visualization by ordering distance matrices. Bioinformatics 21, 2301–2308. [53] Bezdek, J.C., Hathaway, R.J., 2002. Vat: A tool for visual assessment of (cluster) tendency, in: Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN’02 (Cat. No. 02CH37290), IEEE. pp. 2225–2230. [54] Gu, Z., 2022. Complex heatmap visualization. Imeta 1, e43. [55] Jiang, B.S., 2022. ggESDA: Exploratory Symbolic Data Analysis with ’ggplot2’. https://CRAN.R-project.org/package=ggESDA. r package version 0.2.0. [56] Chen, P.W., 2023. dataSDA: Data Sets for Symbolic Data Analysis. https: //CRAN.R-project.org/package=dataSDA. r package version 0.1.0. [57] Rodriguez, O., 2023. RSDA: R to Symbolic Data Analysis. https://CRAN. R-project.org/package=RSDA. r package version 3.1.0. [58] Hahsler, M., Hornik, K., Buchta, C., 2008. Getting things in order: An introduction to the r package seriation. Journal of Statistical Software 25, 1–34. [59] Yu, G., Smith, D., Zhu, H., Guan, Y., Lam, T.T.Y., 2017. ggtree: an r package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods in Ecology and Evolution 8, 28–36. http:// onlinelibrary.wiley.com/doi/10.1111/2041-210X.12628/abstract, [60] Wickham, H., 2016. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org. [61] Pedersen, T.L., 2022. patchwork: The Composer of Plots. https://CRAN. R-project.org/package=patchwork. r package version 1.1.2. [62] Neuwirth, E., 2022. RColorBrewer: ColorBrewer Palettes. https://CRAN. R-project.org/package=RColorBrewer. r package version 1.1-3. [63] Bertrand, P., Goupil, F., 2000. Descriptive statistics for symbolic data, in: Analysis of symbolic data: exploratory methods for extracting statistical information from complex data, Springer. pp. 106–124. [64] Douzal-Chouakria, A., Billard, L., Diday, E., 2011. Principal component analysis for interval-valued observations. Statistical Analysis and Data Mining: The ASA Data Science Journal 4, 229–246. [65] Hastie, T., Tibshirani, R., Friedman, J.H., Friedman, J.H., 2009. The elements of statistical learning: data mining, inference, and prediction. volume 2. Springer. [66] Duarte Silva, P., Brito, M.P., Filzmoser, P., Dias, J.G., 2021. Maint. data: Modelling and analysing interval data in r. MAINT. Data: Modelling and analysing interval data in R , 336–364. [67] García-García, J., 2022. IntervalQuestionStat: Tools to Deal with IntervalValued Responses in Questionnaires. https://CRAN.R-project.org/package= IntervalQuestionStat. r package version 0.2.0. [68] Henderson, H.V., Velleman, P.F., 1981. Building multiple regression models interactively. Biometrics , 391–411. [69] Neto, E.d.A.L., Cordeiro, G.M., De Carvalho, F.D.A., 2011. Bivariate symbolic regression models for interval-valued variables. Journal of Statistical Computation and Simulation 81, 1727–1744. [71] Paradis, E., Schliep, K., 2019. ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics 35, 526–528. [71] de Vries, A., Ripley, B.D., 2022. ggdendro: Create Dendrograms and Tree Diagrams Using ’ggplot2’. https://CRAN.R-project.org/package=ggdendro. r package version 0.1.23. [72] Gross, C., Ottolinger, P., 2016. ggThemeAssist: Add-in to Customize ’ggplot2’ Themes. https://CRAN.R-project.org/package=ggThemeAssist. r package version 0.1.5 |
| 論文全文使用權限 |
如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信