§ 瀏覽學位論文書目資料
  
系統識別號 U0002-1706200815035600
DOI 10.6846/TKU.2008.00476
論文名稱(中文) 具「由下而上」樹狀結構的重疊影像分群演算法
論文名稱(英文) A Crossover-Imaged Clustering Algorithm with Bottom-up Tree Architecture
第三語言論文名稱
校院名稱 淡江大學
系所名稱(中文) 資訊工程學系博士班
系所名稱(英文) Department of Computer Science and Information Engineering
外國學位學校名稱
外國學位學院名稱
外國學位研究所名稱
學年度 96
學期 2
出版年 97
研究生(中文) 張忠義
研究生(英文) Chung-I Chang
學號 892190322
學位類別 博士
語言別 英文
第二語言別
口試日期 2008-06-04
論文頁數 80頁
口試委員 指導教授 - 林丕靜(nancylin@mail.tku.edu.tw)
委員 - 陳穆臻
委員 - 葛煥昭
委員 - 王亦凡
委員 - 蔣定安
委員 - 林丕靜
關鍵字(中) 方格基底分群演算法
方格系統
重要方格
半重要方格
重疊影像
下而上樹
關鍵字(英) grid-based clustering algorithms
grid system
significant cell
semi-significant cell,Crossover-Image
Bottom-up Tree
第三語言關鍵字
學科別分類
中文摘要
方格基底分群演算法的最大優點在於運算速度快,卻往往受限於方格大小及方格密度的參數設定。在本篇論文中,我們提出的演算法,不僅可以較傳統的方格分群演算法提供更容易尋找到正確分群結果所需要的方格大小及方格密度參數範圍,並且繼承方格基底分群演算法運算速度快的優點。
我們所提出的重疊影像分群演算法,係利用座標平移所產生的兩組方格系統來避免因方格邊界所造成錯誤分群的可能,並且增加分群的正確率。並且定義三種層級的方格類型(significant, semi-significant, non-significant)透過「由下而上」的樹狀結構的四層節點與運算(leaf node level-significant cells, connection level-pre-clustering, combination level- semi-significant cells, root-set of clusters),以由下而上階層式的運算,將方格系統中的重要方格進行連結與組合,產生最佳的分群結果。
英文摘要
The grid-based clustering algorithm is an efficient clustering algorithm, but its effect is seriously influenced by the size of the predefined grids and the threshold of the significant cells.  The data space will be partitioned into a finite number of cells to form a grid system and then performs all clustering operations on this obtained grid system. A critical problem needed to be noticed here, no one knows the distribution of the clustering data in each dimension.  So, a method to reduce the influence of the size of predefined cells and the density threshold of significant cells is a must.  To cluster efficiently and simultaneously, to reduce the influences of the size of the cells and inherits the advantage with the low time complexity, we propose “A Crossover-Imaged Clustering Algorithm with Bottom-up Tree Architecture” in this dissertation. 
The data space is separating into cells which are organized the first grid system, and the data space is transformed into the new grid system.  The chosen density threshold and some predefined cells are utilized to identify the significant cells and semi-significant cells.  With our clustering algorithms, the size of significant cell and its density threshold are easy to select to group the significant cells into ideal clusters. Next, the data cells are shifted to define the new grid system. Then the semi-significant cells are selected to group the significant cells and other semi-significant cells to be clusters. Then, we build the bottom-up tree architecture with four levels, root (set of clusters), combination (semi-significant cells), connection (pre-clusters), leaf nodes (significant cells), and define the semi-significant cells to group the clusters.  Finally we will verify by experiment that the results of our proposed algorithms can reduce the influence of the size of predefined cells and the density threshold of significant cells.  And the proposed algorithms still inherit the advantage with the low time complexity and require at most one single scan through the data and one cell clustering.
第三語言摘要
論文目次
Table of Contents
Content…………………………………………………………………………………………………...I
List of Figures ........................................................................................................................................ Ⅳ
List of Tables ......................................................................................................................................... Ⅶ

Chapter 1 Introduction………………………………………………………………...1
1.1 Research Motivation ……………………...................................................................................2
1.2 Research Objective ………………….........................................................................................4
1.3 Organization of this Dissertation................................................................................................6
Chapter 2 Background Knowledge on Data Mining…………………………………7
2.1 Data Mining……………………………………………………………………………………..7
2.1.1 Classification………………………………………………………………………………..8
2.1.2 Regression and Prediction…………………………………………………………………9
2.1.3 Time-series analysis……………………………………………………………………….11
2.1.4 Summarization……………………………………………………………………………12
2.1.5 Association rules…………………………………………………………………………..12
2.1.6 Sequence discovery………………………………………………………………………..14
2.1.7 Clustering………………………………………………………………………………….14
2.2 Related clustering algorithms…..………………………………………………………15
2.2.1 Hierarchical clustering algorithms………………………………………………………15
2.2.2 Partition-based clustering algorithms…………………………………………………...16
2.2.3 Density-based clustering algorithms……………………………………………………..16
2.2.4 model-based clustering algorithms………………………………………………………16
2.2.5 Grid-based clustering algorithms………………………………………………………..17
2.3 Related grid based clustering algorithms……………………………………..17
2.3.1 Hill-Climbing algorithm………………………………………………………………….18
2.3.2 SGDC: Single-phase grid clustering algorithm…………………………………………19
2.3.3 STING.……………………………………………………………………………………..20
2.3.4 CLIQUE.…………………………………………………………………………………..22
2.3.5 GCHL……………………………………………………………………………………...23
2.3.6 NSGC………………………………………………………………………………………28
2.4 Tree architecture………………………………………………………………….31
2.5 Problem Statement………………………………………………………………..36

Chapter 3 The “Crossover-Imaged Clustering Algorithm with Bottom-up Tree Architecture”……………………..…………………………...38
3.1 Basic definition of cells…………………………………………………………...38
3.2 Adaptive Deflect and Conquer Clustering algorithm…………………………..43
3.2.1 ADCC Algorithm……………………………………………………………….…………43
 3.2.2 Example……………………………………………………………………………………46
3.3The Crossover-Imaged Clustering Algorithm with Bottom-up Tree Architecture
………………………………………………………………………………………………………51
3.3.1 Bottom-up Tree Architecture in CIC-BTA algorithm……………………………………..51
3.3.2 CIC-BTA algorithm………………………………………………………………………….53
3.3.3 Example………………………………………………………………………………………57

Chapter 4 Experiments………………………………………………………………62
4.1 The distribution of data…………………………………………………………..62
4.2 The experimental results…………………………………………………………64
4.3 Discussion…………………………………………………………………………67

Chapter 5 Conclusions…...…………………………………………………………...71
References……………………………………………………………………………..72
 
List of Figures
Figure 2.1 a decision tree for the concept buys-house……………………………………………….9
Figure 2.2 time-series plots……………………………………………………………………………11
Figure 2.3 transactional data for market based analysis…………………………………………….13
Figure 2.4 the taxonomy of clustering approaches…………………………………………………15
Figure 2.5 the example of hill climbing……………………………………………………………….18
Figure 2.6 result of clustering…………………………………………………………………………20
Figure 2.7 a hierarchical structure for STING clustering……………………………………...........21
Figure 2.8 identification of clusters in the subspaces (projections) of data space…………………23
Figure 2.9 the flowchart of the GCHL algorithm……………………………..…………………….25
Figure 2.10 depict of GCHL’s cutting planes/dimensions and a cubic block………………………27
Figure 2.11 GCHL’s cutting plane and cubic block………………………………………………...27
Figure 2.12 procedure of the NSGC algorithm………………………………………………….........29
Figure 2.13 the sketch of the group assignment………………………………………………………30
Figure 2.14 the considered cell………………………………………………………………………..30
Figure 2.15 a binary tree………………………………………………………………………………31
Figure 2.16 an n-ary tree of degree 4………………………………………………………………….33
Figure 2.17 a general tree………………………..…………………………………………………….33
Figure 2.18 the dendrogram corresponding to the six points………………………………………35
Figure 3.1 first grid system G1……………………………….………………………………………39
Figure 3.2 second grid system G2…………………………………………………………………….39
Figure 3.3 nearby (neighboring) cells………………………………………………………………….40
Figure 3.4 Three types of significant cells………………………….…………………………………42
Figure 3.5 cater-corner cells of cell X…………………………………………………………………42
Figure 3.6 extent of one cell……………………………………………………………………………43
Figure 3.7 CM function………………………………………………………………………………45
Figure 3.8 two-dimensional example…………………………………………………………………46
Figure 3.9 the first clustering S1 (202 cells, 11 clusters)……………………………………………..47
Figure 3.10 the second clustering S2 (212 cells, 14 clusters)…………………………………………48
Figure 3.11 the final clustering result of ADCC…………………………………………….………51
Figure 3.12 Clustering trees (CT)…………………………………………………………………......52
Figure 3.13 one set of extended significant cells……………………………………………………...55
Figure 3.14 a cater-corner significant cell……………………………………………………………..56
Figure 3.15 figure of example…………………………………………………………………………57
Figure 3.16 the grid system G2………………………………………………………………………...58
Figure 3.17 the image of significant cells…………………………………………………………......59
Figure 3.18 the grid system G1…………………………………………………………………….......59
Figure 3.19 the image of significant cells……………………………………………………………..60
Figure 3.20 crossover images of significant cells……………………………………………………..61
Figure 3.21 the pre-clustering result…………………………………………………………………..61
Figure 4.1 Experiment 1……………………………………………………………………………….63
Figure 4.2 Experiment 2……………………………………………………………………………….63
Figure 4.3 Experiment 3……………………………………………………………………………….63
Figure 4.4 Experiment 4……………………………………………………………………………….63
Figure 4.5 Experiment 5……………………………………………………………………………….63
Figure 4.6 Experiment 6……………………………………………………………………………….63
Figure 4.7 Correct rates of CIC-BTA and CLIQUE…………………………………………...........64
Figure 4.8 Hill-Climbing………………………….……………………………………………...........68
Figure 4.9 CIC-BTA……………………………………………………………………………...........68
Figure 4.10 K-mean……………………………..………………………………………………...........69
Figure 4.11 CIC-BTA……………………………..……………………………………………...........69

 
List  of  Tables
Table 2.1 salary data……………………………………………………………………………………9
Table 3.1 rules of R0……………………………………………………………………….49
Table 3.2 rules of R0………………………………………………………………………..49
Table 3.3 the set of final clusters…………………………………………………………………….50
Table 4.1 Experimental data information……………………………………………………………62
Table 4.2 feature of data distribution………………………………………………………………..62
Table 4.3 the correct rate comparison sheet of experiment 1………………………………………64
Table 4.4 the correct rate comparison sheet of experiment 2………………………………………65
Table 4.5 the correct rate comparison sheet of experiment 3………………………………………66
Table 4.6 the correct rate comparison sheet of experiment 4………………………………………66
Table 4.7 the correct rate comparison sheet of experiment 5………………………………………66
Table 4.8 the correct rate comparison sheet of experiment 6………………………………………66
Table 4.9 average executing time and number of clusters among four algorithms………………..67
Table 4.10 Comparison of CLIQUE and CIC-BTA (d: density threshold)............………………..69
Table 4.11 Comparison of CLIQUE and CIC-BTA (m: number of divided parts)….……………..70
Table 4.12 Comparison of CLIQUE and CIC-BTA (m, d)………………………………………....70
參考文獻
References:
[ 1] M. Ester, H. Kriegel, J. Sander, and X. Xu. “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise”, In Proc. of 2nd Int. Conf. on KDD, 1996, pages 226-231.
[ 2] A. Hinneburg and D. A. Keim,. “An Efficient Approach to Clustering in Large Multimedia Databases with Noise”, In Knowledge Discovery and Data Mining, 1998, pages 58-65.
[ 3] ANKERST M. etc. “OPTICS: Ordering Points to Identify the Clustering Structure.” In Proc. ACM SIGMOD Int. Conf. on MOD, 1999, pages 49-60.
[ 4] Wang, Yang, R. Muntz, Wei Wang and Jiong Yang and Richard R. Muntz “STING: A Statistical Information Grid Approach to Spatial Data Mining”, In Proc. of 23rd Int. Conf. on VLDB, 1997, pages 186-195.
[ 5] G. Sheikholeslami, S. Chatterjee, and A. Zhang. “WaveCluster: a wavelet-based clustering approach for spatial data in very large databases”, In VLDB Journal: Very Large Data Bases, 2000, pages 289-304.
[ 6] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. “Automatic sub¬space clustering of high dimensional data for data mining applications”, In Proc. of ACM SIGMOD Int. Conf. MOD, 1998, pages 94-105.
[ 7] Z. Huang, "Extensions to the k-means algorithm for clustering large data sets", Data Mining and Knowledge Discovery, 1998, vol. 2, pages 283-304.
[ 8] J. MacQieen. Some methods for classification and analysis of multivariate observation. Proc. 5th Berkeley Symp. Math. Statist, Prob., 1:281-297, 1967
[ 9] L. Kaufman and P.J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. New York: John Wiley & Sons, 1990.
[10] N. Lin, C. Chang, and C. Pan. “An Adaptive Deflect and Conquer Clustering algorithm”, In Proc. of 6th WSEAS Int. Conf. ACOS’07, 2006, pages 156-160.
[11] Charu C. Aggarwal, Philip S. Yu, “An effective and efficient algorithm for high-dimensional outlier detection” The VLDB journal, 14:211-221, 2005 
[12] A. H. Pilevar, M. Sukumar, “GCHL: A grid-clustering algorithm for high-dimensional very large spatial data bases”, Pattern Recognition Letters 26(2005), 999-1010 
[13] ZHAO Y.C., SONG J.,”GDILC: A Grid-based Density-Isoline Clustering Algorithm.”, In Proc. Internat. Conf. on Info-net, Vol. 3, pp.140-145, 2001,
[14] Ma, W.M., Eden, Chow, Tommy, W.S., “A new shifting grid clustering algorithm”, Pattern Recognition 37 (3), 2004, 503-514
[15] Alevizos, P., Boutsinas, B., Tasoulis, D., Vrahatis, M.N.,”Improving the K-windows clustering algorithm”, In Proc. 14th IEEE Internat. Conf. on Tools with Artificial Intell., pp.239-245, 2002.
[16] Claude E. Shannon. “A mathematical theory of communication”. Bell System Technical Journal, 27:379–423 and 623–656, July and October 1948.
[17] Jiawei Han, Micheline Kamber, Data Mining: Concepts and Techniques. Morgan Kaufmann Press, 2001.
[18] Usama M. Fayyad, Piatetsky-Shapiro, and Padhraic Smyth. From data mining to knowledge discovery: An overview. AAAI/MIT Press, 1996
[19] L. Kaufman and P.J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. New York: John Wiley & Sons, 1990.
[20] S. L. Lauritzen. The EM algorithm for graphical association models with missing data. Computational Statistics and Data Analysis, 19:191-201, 1995.
[21] R. Ng and J. Han, Efficient and effective clustering method for spatial data mining. In Proc. 1994 Int. Conf. Very Large Data Bases (VLDB’94), pages 144-155, Santiago, Chile, Sept. 1994.
[22] M. Ester, H. –P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases: Focusing techniques for efficient class identification. In Proc. 4th Int. Symp. Large Spatial Databases (SSD’95), pages 67-82, Porland, ME, Aug. 1995
[23] P.Bradley, U. Fayyad, and C. Reina. Scaling clustering algorithms to large databases. In Proc. 1998 Int. Conf. Knowledge Discovery and Data Mining (KDD’98), pages 9-15, New York, Aug. 1998.
[24] T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An efficient data clustering method for very large databases. In Proc. 1996 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD’96), pages 103-114, Montreal, Canada, June 1996.
[25] S. Guha, R. Rastigi, and K. Shim. Cure: An efficient clustering algorithm for large databases. In Proc. 1998 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD’98), pages 73-84, Seattle, WA, June. 1998
[26] G. Karypis, E. –H. Han, and V. Kumar. CHAMELEON: A hierarchical clustering algorithm using dynamic modeling. COMPUTER, 32:68-75, 1999.
[27] S. Guha, R. Rastigi, and K. Shim. Rock: A robust clustering algorithm for categorical attributes. In Proc. 1999 Int. Proc. Data Engineering (ICDE’99), pages 512-521, Sydney, Australia, Mar. 1999.
[28] M. Ester, H. –P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusaters in large spatial databases. In Proc. 1996 Int. Conf. Knowledge Discovery and Data mining (KDD’96), pages 226-231, Porland, OR, Aug. 1996.
[29] A. Hinnehurg and D. A. Keim. An efficient approach to clustering in large multimedia databases with noise. In Proc. 1998 Int. Conf. Knowledge Discovery and Data mining (KDD’98), pages 58-65, New York, Aug. 1998.
[30] D. Fisher. Improving inference through conceptual clustering. In Proc. 1987 AAAI Conf., pages 461-465, Seattle, WA, July 1987.
[31] J. Gennari, P. Langley, and D.Fisher. Models of incremental concept formation. Artificial Intelligence, 40:11-61, 1989.
[32] T. Kohonen. Self-organized formation of topologically correct feature maps. Iological Cybernetics, 43:59-69, 1982. 
[33] J. Han, Y. Fu, W. Wang, K. Koperski, and O.R. Zaiane. DMQL: A data mining query language for relational databases. Proceedings of Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD’96), pages 27-34, June 1996.
[34] Amy. E. Graham and Robert J. Morse. How U.S. News ranks colleges. U.S. News & World Report, pages 84-87, August 1999.
[35] Rakesh Agrawal, Tomasz Imielinski, and Arun N. Swami. Mining association rules between sets of items in large databases. Proceedings of the ACM International Conference on Management of Data, pages 207-216, 1993.
[36] Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining association rules in large databases. Proceedings of the international very large databases conference, pages 487-499, 1994.
[37] Ramakrishnan Srikant. Fast algorithms for mining association rules and sequential patterns. Technical report, PhD Dissertation, University of Wisconsin, 1996.
[38] A. K. Jain, M. N. Murty, and P. J. Flynn. Data Clustering: A review, ACM Computing Survey, Vol. 31, No.3, Sep. 1999
[39] Margaret H. Dunham, data mining introductory and advanced topics, Pearson Education, Inc., Upper Saddle River, New Jersey, 2003
[40] A. Hinneburg and D. A. Keim. Clustering methods for large databases: From the past to the future. Technical report, ACM SIGMOD Tutorial, 1999.
[41] J. Gehrke, R. Ramakrishnana, and V. Ganti. Rainforest---a framework for fast decision tree construction of large datasets. Proceedings of the international Very Large Databases Conference, pages 416-427, 1998.
[42] Markus M. Breunig, Hans-Peter Kriegel, Peer Kroger, and Jorg Sander. Data bubbles: Quality preserving performance boosting for hierarchical clustering. Proceedings of the ACM international Conference on Management of Data, pages 79-90, 2001.
[43] Yongqiao Xiao and Margaret H. Dunham. Efficient mining of traversal patterns. Data and Knowledge Engineering, 39(2):191-214, November 2001.
[44] Ruby L. Kennedy, Yuchun Lee, Benjamin Van Roy, Christopher D. Reed, and Richard P. Lippman. Solving Data Mining Problems Through Pattern Recognition. Englewood Cliffs, N.J.: Prentice Hall, 1998.
[45] Tamer Ozsu and Patrick Valduriez. Principles of Distributed Database Systems. Englewood Cliffs, N.J.:Prentice Hall, 1999. 
[46] D. E. Goldberg. Genetic Algorithms in Search, Optimization, and Machine Learning. Reading, Mass.: Addison Wesley, 1989.
[47] D. Fisher. Improving inference through conceptual clustering. In Proc. 1987 AAAI Conf., pages 461-465, Seattle, WA, July 1987.
[48] P. Cheeseman and J. Stutz. Bayesian classification (AutoClass): Theory and results. In U. M. Fayyad, G. Piatetsky-shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, pages 153-180. Cambridge, MA: AAAI/MIT Press, 1996.
[49] D. E. Rumelhart and D. Zipser. Feature discovery by competitive learning. Cognitive Science, 9:75-112, 1985.
[50] T. Kohonen. Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43:59-69,1982.
[51] Russell, Stuart J. & Norvig, Peter, Artificial Intelligence: A Modern Approach (2nd ed.), Upper Saddle River, NJ: Prentice Hall, pp. 111-114, 2003
[52] Day, W. H. E. complexity theory: An introduction for practitioners of classification. In Clustering and Classification, P. Arabie and L. Hubert, Eds. World Scientific Publishing Co., Inc., River Edge, NJ, 1992
論文全文使用權限
校內
紙本論文於授權書繳交後2年公開
同意電子論文全文授權校園內公開
校內電子論文於授權書繳交後1年公開
校外
同意授權
校外電子論文於授權書繳交後1年公開

如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信