§ 瀏覽學位論文書目資料
  
系統識別號 U0002-2501202402071400
DOI 10.6846/tku202400075
論文名稱(中文) R語言區間象徵型資料矩陣視覺化探索式分析平台:iGAP套件
論文名稱(英文) iGAP:Matrix visualization for interval value based on GAP framework in R
第三語言論文名稱
校院名稱 淡江大學
系所名稱(中文) 統計學系應用統計學碩士班
系所名稱(英文) Department of Statistics
外國學位學校名稱
外國學位學院名稱
外國學位研究所名稱
學年度 112
學期 1
出版年 113
研究生(中文) 陳紹安
研究生(英文) Shao-An Chen
學號 610650193
學位類別 碩士
語言別 繁體中文
第二語言別
口試日期 2024-01-10
論文頁數 49頁
口試委員 指導教授 - 高君豪(157294@mail.tku.edu.tw)
口試委員 - 楊文
口試委員 - 吳漢銘
關鍵字(中) 廣義相關圖
探索性分析
象徵型資料分析
區間資料
資料視覺化
排序演算法
R套件
關鍵字(英) Generalized Association Plots
Exploratory Data Analysis
Symbolic Data Analysis
Interval Data
Data Visualization
Sorting Algorithm
R Package
第三語言關鍵字
學科別分類
中文摘要
R 語言在統計領域非常熱門,許多新穎的統計研究方法都會發布成 R 語言的套件。本研究旨在 R 語言中實現廣義相關圖(Generalized Association Plots; GAP)以處理區間資料,能夠被 R 語言龐大的象徵型區間資料庫與演算法套件所支援。GAP 是矩陣視覺化(Matrix Visualization; MV)的擴展。特別著重於象徵型區間資料的探索性分析 (Exploratory data analysis; EDA),在進行複雜統計方法之前能讓資料自述其故事。對此本研究開發出一款套件名為 iGAP,將 GAP 概念引入 R 中,使區間資料進行排列並視覺化,讓使用者能快速且清晰的了解到資料之間的關係與型樣。我們相信 GAP 作為一個進階的探索性資料分析工具,在 R 中不僅能應用在區間資料上,未來也能對各種資料型態進行推廣。
英文摘要
R language is highly popular in the statistical field, with many innovative statistical methods being released as R packages. This study aims to implement Generalized Association Plots (GAP) for interval data in R, supported by a vast repository of symbolic interval data and algorithmic packages in R. GAP is an extension of Matrix Visualization (MV). Focusing particularly on the Exploratory Data Analysis (EDA) of symbolic interval data, the study facilitates data to narrate its story before applying complex statistical methods. To this end, we have developed a package named iGAP, integrating the GAP concept into R, enabling the arrangement and visualization of interval data for users to quickly and clearly understand the relationships and patterns within the data. We believe that GAP, as an advanced tool for EDA in R, can be applied to interval data and potentially extended to various data types in the future.
第三語言摘要
論文目次
目錄
第一章 緒論 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 研究背景與動機 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 研究目的 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 論文架構 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
第二章 文獻探討. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 象徵型資料分析 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 廣義相關圖 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 象徵型區間資料矩陣視覺化 . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.1 區間資料變量相關性關係矩陣 . . . . . . . . . . . . . . . . . . . 7
2.3.2 區間資料概念距離關係矩陣 . . . . . . . . . . . . . . . . . . . . 9
2.4 排序演算法 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
第三章 套件設計與相關套件 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1 iGAP 設計流程 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 資料預處理與建立 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 象徵型資料處理 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4 視覺化技術 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.5 iGAP 函式總結 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
第四章 實例應用. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1 汽車資料 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 食用油資料 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
第五章 結論與未來工作 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.1 研究總結 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2 未來研究方向 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
圖目錄
圖 2.1 一般資料和象徵型區間資料矩陣與其變量之間相關性矩陣及概念之
間距離矩陣圖 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
圖 2.2 區間資料矩陣視覺化漸層上色 . . . . . . . . . . . . . . . . . . . . . 7
圖 3.1 iGAP 套件設計流程圖 . . . . . . . . . . . . . . . . . . . . . . . . . 16
圖 3.2 ex1_db2so 區間資料量化後的關係矩陣視覺化 . . . . . . . . . . . . 23
圖 3.3 區間資料標準化上色 . . . . . . . . . . . . . . . . . . . . . . . . . . 24
圖 3.4 兩個關係矩陣進行 R2E 排序 . . . . . . . . . . . . . . . . . . . . . . 25
圖 3.5 uscrime_int 區間資料矩陣視覺化 . . . . . . . . . . . . . . . . . . . 27
圖 3.6 RColorBrewer 色譜 . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
圖 3.7 ex1_db2so 區間資料廣義相關圖,參數變數為預設 . . . . . . . . . 31
圖 4.1 Cars.int 區間資料廣義相關圖,Concept 間距離採用 Span Normalized Euclidean Hausdorff distance,變數相關性採用 Empirical Correlation Coefficient,並且使用 R2E 進行排序 . . . . . . . . . . . . . . . . 35
圖 4.2 oils 區間資料廣義相關圖,Concept 間距離採用 Span Normalized
Euclidean Hausdorff distance,變數相關性採用 Empirical Correlation
Coefficient,並且使用 Single-linkage agglomerative algorithm 進行排序 37
圖 5.1 iGAP 套件介面化示意圖 . . . . . . . . . . . . . . . . . . . . . . . . 41
表目錄
表 3.1 R 語言 dataSDA 套件內建的區間資料 . . . . . . . . . . . . . . . . 19
表 3.2 R 語言 RSDA 套件內建的區間資料 . . . . . . . . . . . . . . . . . . 20
表 3.3 R 語言 ggESDA 套件內建的區間資料 . . . . . . . . . . . . . . . . . 20
表 3.4 運算環境 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
表 3.5 iGAP 函式總結 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
參考文獻
參考文獻
[1] Tukey, J.W., et al., 1977. Exploratory data analysis. volume 2. Reading, MA.
[2] Chen, C.H., 2002. Generalized association plots: Information visualization via
iteratively generated correlation matrices. Statistica Sinica , 7–29
[3] Wu, H.M., Tien, Y.J., Chen, C.H., 2010. GAP: A graphical environment for matrix
visualization and cluster analysis. Computational Statistics and Data Analysis 54,
767–778.
[4] Kao, C.H., Nakano, J., Shieh, S.H., Tien, Y.J., Wu, H.M., Yang, C.k., Chen,
C.h., 2014. Exploratory data analysis of interval-valued symbolic data with matrix
visualization. Computational Statistics & Data Analysis 79, 14–29.
[5] Denoeux, T., Masson, M., 2000. Multidimensional scaling of interval-valued dissimilarity data. Pattern Recognition Letters 21, 83–92.
[6] Billard, L., Diday, E., 2000. Regression analysis for interval-valued data, in: Kiers,
H.A.L., Rasson, J.P., Groenen, P.J.F., Schader, M. (Eds.), Data Analysis, Classification, and Related Methods, Springer Berlin Heidelberg, Berlin, Heidelberg. pp.
369–374.
[7] Bock, H.H., Diday, E., 1999. Analysis of symbolic data: exploratory methods for
extracting statistical information from complex data. Springer Science & Business
Media.
[8] Billard, L., Diday, E., 2003. From the statistics of data to the statistics of knowledge: symbolic data analysis. Journal of the American Statistical Association 98,
470–487.
[9] Billard, L., Diday, E., 2006. Symbolic data analysis: Conceptual statistics and
data mining john wiley. Search in .
[10] Diday, E., Noirhomme-Fraiture, M., 2008. Symbolic data analysis and the SODAS
software. John Wiley & Sons.
[11] Chouakria, A., Cazes, P., Diday, E., Bock, H., 2000. Symbolic principal component
analysis. Analysis of Symbolic Data, ed. HH Bock, and E. Diday , 200–212.
[12] Palumbo, F., Lauro, C.N., 2003. A pca for interval-valued data based on midpoints
and radii, in: New developments in Psychometrics. Springer, pp. 641–648.
[13] Gioia, F., Lauro, C.N., 2006. Principal component analysis on interval data. Computational Statistics 21, 343–363.
[14] Hamada, A., Minami, H., Mizuta, M., 2008. Principal component analysis for
modal interval-valued data, in: Proceedings of IASC2008, the Joint Meeting of
4th World Conference of the IASC and 6th Conference of the Asian Regional
Section of the IASC on Computational Statistics & Data Analysis, pp. 512–519.
[15] Brito, P., 2002. Hierarchical and pyramidal clustering for symbolic data. Journal
of the Japanese Society of Computational Statistics 15, 231–244.
[16] Hans-Hermann, B., 2003. Clustering algorithms and kohonen maps for symbolic
data. Journal of the Japanese Society of Computational Statistics 15, 217–229.
[17] de Souza, R.M., De Carvalho, F.d.A., 2004. Clustering of interval data based on
city–block distances. Pattern Recognition Letters 25, 353–365.
[18] Golli, A.E., Conan-Guez, B., Rossi, F., 2004. A self-organizing map for dissimilarity data, in: Classification, Clustering, and Data Mining Applications. Springer,
pp. 61–68.
[19] Chavent, M., de Carvalho, F.d.A., Lechevallier, Y., Verde, R., 2006. New clustering
methods for interval data. Computational statistics 21, 211–229.
[20] De Carvalho, F.d.A., Brito, P., Bock, H.H., 2006. Dynamic clustering for interval
data based on l 2 distance. Computational Statistics 21, 231–250.
[21] Bock, H.H., 2008. Visualizing symbolic data by kohonen maps. Symbolic Data
Analysis and the SODAS Software, Wiley , 205–234.
[22] Lauro, N., Verde, R., Palumbo, F., 2000. Factorial discriminant analysis on symbolic objects.
[23] Silva, A.P.D., Brito, P., 2006. Linear discriminant analysis for interval data. Computational Statistics 21, 289–308.
[24] Billard, L., Diday, E., 2000. Regression analysis for interval-valued data, in: Data
analysis, classification, and related methods. Springer, pp. 369–374.
[25] De Carvalho, F., Neto, E., 2008. Centre and range method for fitting a linear
regression model to symbolic intervalar data. Computational Statistics & Data
Analysis 52, 1500–1515.
[26] Neto, E.d.A.L., De Carvalho, F.D.A., 2010. Constrained linear regression models
for symbolic interval-valued variables. Computational Statistics & Data Analysis
54, 333–347.
[27] Verde, R., Lechevallier, Y., 2005. Crossed clustering method on symbolic data
tables, in: New developments in classification and data analysis. Springer, pp.
87–94.
[28] Chen, C.H., Hwu, H.G., Jang, W.J., Kao, C.H., Tien, Y.J., Tzeng, S., Wu, H.M.,
2004. Matrix visualization and information mining, in: COMPSTAT 2004—Proceedings in Computational Statistics, Springer. pp. 85–100.
[29] Ghoniem, M., Fekete, J.D., Castagliola, P., 2005. On the readability of graphs
using node-link and matrix-based representations: a controlled experiment and
statistical analysis. Information visualization 4, 114–135.
[30] Henry, N., Fekete, J.D., 2006. Matrixexplorer: a dual-representation system to
explore social networks. IEEE transactions on visualization and computer graphics
12, 677–684.
[31] Liiv, I., 2010. Seriation and matrix reordering methods: An historical overview.
Statistical Analysis and Data Mining: The ASA Data Science Journal 3, 70–91.
[32] Wu, H.M., Tzeng, S., Chen, C.h., 2008. Matrix visualization, in: Handbook of
data visualization. Springer, pp. 681–708.
[33] Bertin, J., 1983. Semiology of graphics. University of Wisconsin press.
[34] Tien, Y.J., Lee, Y.S., Wu, H.M., Chen, C.H., 2008. Methods for simultaneously
identifying coherent local clusters with smooth global patterns in gene expression
profiles. BMC bioinformatics 9, 1–16.
[35] Wu, H.M., Tien, Y.J., Chen, C.h., 2010. Gap: A graphical environment for matrix
visualization and cluster analysis. Computational Statistics & Data Analysis 54,
767–778.
[36] Liiv, I., Opik, R., Ubi, J., Stasko, J., 2012. Visual matrix explorer for collaborative
seriation. Wiley Interdisciplinary Reviews: Computational Statistics 4, 85–97.
[37] Pearson, K., 1948. Early statistical papers. University Press.
[38] Billard, L., 2008. Sample covariance functions for complex quantitative data, in:
Proceedings of World IASC Conference, Yokohama, Japan, pp. 157–163.
[39] Gowda, K.C., Diday, E., 1991. Symbolic clustering using a new dissimilarity measure. pattern recognition 24, 567–578.
[40] Ichino, M., 1988. General metrics for mixed features the cartesian space theory for
pattern recognition, in: Proceedings of the 1988 IEEE International Conference
on Systems, Man, and Cybernetics, IEEE. pp. 494–497.
[41] Chavent, M., Lechevallier, Y., 2002. Dynamical clustering of interval data: Optimization of an adequacy criterion based on hausdorff distance, in: Classification,
Clustering, and Data Analysis: Recent Advances and Applications. Springer, pp.
53–60.
[42] Hahsler, M., Hornik, K., Buchta, C., 2008. Getting things in order: an introduction
to the r package seriation. Journal of Statistical Software 25, 1–34.
[43] Gruvaeus, G., Wainer, H., 1972. Two additions to hierarchical cluster analysis.
British Journal of Mathematical and Statistical Psychology 25, 200–206.
[44] Bar-Joseph, Z., Gifford, D.K., Jaakkola, T.S., 2001. Fast optimal leaf ordering for
hierarchical clustering. Bioinformatics 17, S22–S29.
[45] Cox, T.F., Cox, M.A., 2000. Multidimensional scaling. CRC press.
[46] Sammon, J.W., 1969. A nonlinear mapping for data structure analysis. IEEE
Transactions on computers 100, 401–409.
[47] Van Laarhoven, P.J., Aarts, E.H., van Laarhoven, P.J., Aarts, E.H., 1987. Simulated annealing. Springer.
[48] Bohachevsky, I.O., Johnson, M.E., Stein, M.L., 1986. Generalized simulated annealing for function optimization. Technometrics 28, 209–217.
[49] Bottou, L., 2012. Stochastic gradient descent tricks, in: Neural Networks: Tricks
of the Trade: Second Edition. Springer, pp. 421–436.
[50] Lawler, E.L., 1963. The quadratic assignment problem. Management science 9,
586–599.
[51] Gutin, G., Yeo, A., Zverovich, A., 2002. Traveling salesman should not be greedy:
domination analysis of greedy-type heuristics for the tsp. Discrete Applied Mathematics 117, 81–86.
[52] Tsafrir, D., Tsafrir, I., Ein-Dor, L., Zuk, O., Notterman, D.A., Domany, E., 2005.
Sorting points into neighborhoods (spin): data analysis and visualization by ordering distance matrices. Bioinformatics 21, 2301–2308.
[53] Bezdek, J.C., Hathaway, R.J., 2002. Vat: A tool for visual assessment of (cluster)
tendency, in: Proceedings of the 2002 International Joint Conference on Neural
Networks. IJCNN’02 (Cat. No. 02CH37290), IEEE. pp. 2225–2230.
[54] Gu, Z., 2022. Complex heatmap visualization. Imeta 1, e43.
[55] Jiang, B.S., 2022. ggESDA: Exploratory Symbolic Data Analysis with ’ggplot2’.
https://CRAN.R-project.org/package=ggESDA. r package version 0.2.0.
[56] Chen, P.W., 2023. dataSDA: Data Sets for Symbolic Data Analysis. https:
//CRAN.R-project.org/package=dataSDA. r package version 0.1.0.
[57] Rodriguez, O., 2023. RSDA: R to Symbolic Data Analysis. https://CRAN.
R-project.org/package=RSDA. r package version 3.1.0.
[58] Hahsler, M., Hornik, K., Buchta, C., 2008. Getting things in order: An introduction to the r package seriation. Journal of Statistical Software 25, 1–34.
[59] Yu, G., Smith, D., Zhu, H., Guan, Y., Lam, T.T.Y., 2017. ggtree: an r package
for visualization and annotation of phylogenetic trees with their covariates and
other associated data. Methods in Ecology and Evolution 8, 28–36. http://
onlinelibrary.wiley.com/doi/10.1111/2041-210X.12628/abstract,
[60] Wickham, H., 2016. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag
New York. https://ggplot2.tidyverse.org.
[61] Pedersen, T.L., 2022. patchwork: The Composer of Plots. https://CRAN.
R-project.org/package=patchwork. r package version 1.1.2.
[62] Neuwirth, E., 2022. RColorBrewer: ColorBrewer Palettes. https://CRAN.
R-project.org/package=RColorBrewer. r package version 1.1-3.
[63] Bertrand, P., Goupil, F., 2000. Descriptive statistics for symbolic data, in: Analysis
of symbolic data: exploratory methods for extracting statistical information from
complex data, Springer. pp. 106–124.
[64] Douzal-Chouakria, A., Billard, L., Diday, E., 2011. Principal component analysis
for interval-valued observations. Statistical Analysis and Data Mining: The ASA
Data Science Journal 4, 229–246.
[65] Hastie, T., Tibshirani, R., Friedman, J.H., Friedman, J.H., 2009. The elements of
statistical learning: data mining, inference, and prediction. volume 2. Springer.
[66] Duarte Silva, P., Brito, M.P., Filzmoser, P., Dias, J.G., 2021. Maint. data: Modelling and analysing interval data in r. MAINT. Data: Modelling and analysing
interval data in R , 336–364.
[67] García-García, J., 2022. IntervalQuestionStat: Tools to Deal with IntervalValued Responses in Questionnaires. https://CRAN.R-project.org/package=
IntervalQuestionStat. r package version 0.2.0.
[68] Henderson, H.V., Velleman, P.F., 1981. Building multiple regression models interactively. Biometrics , 391–411.
[69] Neto, E.d.A.L., Cordeiro, G.M., De Carvalho, F.D.A., 2011. Bivariate symbolic
regression models for interval-valued variables. Journal of Statistical Computation
and Simulation 81, 1727–1744.
[71] Paradis, E., Schliep, K., 2019. ape 5.0: an environment for modern phylogenetics
and evolutionary analyses in R. Bioinformatics 35, 526–528.
[71] de Vries, A., Ripley, B.D., 2022. ggdendro: Create Dendrograms and Tree
Diagrams Using ’ggplot2’. https://CRAN.R-project.org/package=ggdendro. r
package version 0.1.23.
[72] Gross, C., Ottolinger, P., 2016. ggThemeAssist: Add-in to Customize ’ggplot2’
Themes. https://CRAN.R-project.org/package=ggThemeAssist. r package version 0.1.5
論文全文使用權限
國家圖書館
同意無償授權國家圖書館,書目與全文電子檔於繳交授權書後, 於網際網路立即公開
校內
校內紙本論文立即公開
同意電子論文全文授權於全球公開
校內電子論文立即公開
校外
同意授權予資料庫廠商
校外電子論文立即公開

如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信