淡江大學覺生紀念圖書館 (TKU Library)
進階搜尋


下載電子全文限經由淡江IP使用) 
系統識別號 U0002-2706200510181800
中文論文名稱 以基因演算法為基礎之非監督式分群技術
英文論文名稱 Unsupervised Clustering Techniques Based on Genetic Algorithms
校院名稱 淡江大學
系所名稱(中) 資訊工程學系博士班
系所名稱(英) Department of Computer Science and Information Engineering
學年度 93
學期 2
出版年 94
研究生中文姓名 楊富文
研究生英文姓名 Fu-Wen Yang
電子信箱 112405@mail.tku.edu.tw
學號 887190089
學位類別 博士
語文別 英文
口試日期 2005-06-24
論文頁數 70頁
口試委員 指導教授-林慧珍
委員-徐道義
委員-黃俊堯
委員-施國琛
委員-顏淑惠
委員-洪文斌
中文關鍵字 非監督式分群  基因演算法  交配  突變  群聚效度  馬可夫鏈 
英文關鍵字 unsupervised clustering  genetic algorithm  crossover  mutation  cluster validity  Markov chain 
學科別分類 學科別應用科學資訊工程
中文摘要 聚類是一種非監督式地將輸入資料自動分類分群,而聚類方法之好壞,一般須具備下列幾點特性:(1)同類者具同質性,即同一類之資料間差異性要越小越好;(2)不同類者具異質性,即不同類之類別間差異性要越大越好。在許多文獻中,對於監督式聚類之問題,已有許多演算法被提出,如常見的有:K-means, Fuzzy-c-means, EM, …,等方法。然而,現實生活中卻存在著許多非監督式聚類之問題,要解決此類問題,常見的方法是利用基因演算法演化過程來得到一個不錯的解。在這篇論文中,我們提出了兩種不同策略之演化聚類演算法,來解決此非監督式聚類之問題;此兩種不同策略之演化聚類演算法都是利用資料本身來當作各類群中心之候選者來編碼染色體,也就是所謂的二進位編碼方式,故我們可事先建好各資料間之距離表,以減少評估適合度函數之計算時間。第一種策略利用較有效率基因運算子(複製、交配和突變)之基因演算法演化過程來設計,且因採二進位編碼方式,故在作基因運算子運算時,不需大量較費時之浮點運算,因此本方法節省了大量的運算時間。另一種提出之策略,則是改良Yong Gao所提出方法─在傳統基因演算法各染色體演化過程中產生早熟之收斂現象,將其轉換成染色體各基因間之馬可夫鏈 (Markov chain) 機率模式來模擬演化過程。此法因不須作任何傳統基因運算子之運算,只須利用染色體各基因之Markov鏈機率模式來模擬演化過程。因此本方法又節省了大量的運算時間。至於適合度函數之設計方面,都採用了Davies-Bouldin index來作適合度函數-其具備前述之聚類所需之特性。我們也實做了文獻中所提出較具效率之其它基因演算法式聚類方法來做比較,實驗結果展示出我們提出此兩種不同策略之基因聚類演算法皆比其它方法快速且有效。此結果也驗證了此篇論文的貢獻。
英文摘要 The number of clusters of a data set is not known in most real life situations, and no clustering system is capable of efficiently automated forming nature groups of the input patterns in these situations. Such difficult problems are called unsupervised clustering or non-parametric clustering, for which the evolutionary approaches are often employed to provide appropriate clustering results. The best-known evolutionary techniques are genetic algorithms (GAs). In this thesis, we propose two evolutionary strategies for unsupervised clustering. One is GA-based, and the other is based on population Markov chain modeling, which improves the Yong Gao’s algorithm to transform the evolutionary process of the population in the canonical genetic algorithms into the probability of Markov chain modeling for each gene. In both strategies, our algorithms select data from the data set as the candidates of cluster centers, and adopt binary representation to encode the number of cluster centers. In order to speed up the evaluation of fitness functions, a look up table of distances between all pairs of data points is developed in advance. In the first strategy, more effective operators of crossover and mutation are introduced. Because we use a binary representation to encode the number of clusters, unlike string representation (real-number encoding), we save a great deal of time for float-point computation during GAs operations (e.g. reproduction, crossover and mutation). In the second strategy, the algorithms produce a new chromosome according to the probability of Markov chain modeling for each gene without any conventional GAs operators. Hence, these proposed algorithms save a lot of computational costs than other proposed GA-based clustering algorithms. Finally, the Davies-Bouldin index is employed to measure the validity of the clusters. The superiority of the proposed algorithms over others is demonstrated in the experimental results, which show that the proposed algorithms achieve better performance in less computation time in comparison with other proposed genetic clustering algorithms.
論文目次 CONTENTS
List of Figures III
CHAPTER 1 INTRODUCTION 1
1.1 Unsupervised Clustering 1
1.2 Motivation and Related Work 2
1.3 Measures for Cluster Validity 4
1.4 Organization of the Dissertation 5
CHAPTER 2 GENETIC ALGORITHMS 6
2.1 Canonical Genetic Algorithm 6
2.2 Operators of Genetic Algorithms 8
2.3 How to implement a genetic algorithm 10
CHAPTER 3 PRELIMINARY OF OUR PROPOSED GENETIC CLUSTERING METHODS 12
3.1 Binary Representation 12
3.2 Population Initialization 13
3.3 Fitness Function Evaluation 14
CHAPTER 4 THE FIRST PROPOSED METHOD: A GA-BASED CLUSTERING ALGORITHM 16
4.1 Reproduction 16
4.2 Crossover 19
4.3 Mutation 22
4.4 The Experimental Results 24
4.5 Remarks 26
CHAPTER 5 THE SECOND PROPOSED METHOD: GENETIC CLUSTERING BASED ON POPULATION MARKOV CHAIN MODELING 35
5.1 Motivation 35
5.2 Preliminary 37
5.3 Genetic Algorithm with No Genetic Operators 40
5.4 Population Markov Chain Clustering (PMCC) Algorithm 44
5.5 Winner-Population Markov Chain Clustering (WPMCC) Algorithm 47
5.6 The Experimental Results 50
5.7 Remarks 60
CHAPTER 6 CONCLUSIONS AND FUTURE WORK 62
BIBLIOGRAPHY 63
PUBLICATION LIST 66
THE CLUSTERING SYSTEM 68

List of Figures

Figure 1. Sizes of 100 artificial data sets. 28
Figure 2. Average maximum fitness values produced by the two methods with the first set of parameters tested 10 times on 100 artificial data sets. 28
Figure 3. Average maximum fitness values produced by the two methods with the second set of parameters tested 10 times on 100 artificial data sets. 29
Figure 4. Average processing time per data point required by each of the two methods using the first set of parameters tested 10 times on 100 artificial data sets. 29
Figure 5. Average processing time per data point required by each of the two methods using the second set of parameters tested 10 times on 100 artificial data sets. 30
Figure 6. Four selected data sets. 31
Figure 7. The clustering results of the proposed method for the data sets shown in Figure 6. The cluster centers are marked with ‘Ο’.............................................. 32
Figure 8. The clustering results of the GCUK method for the data sets shown in
Figure 6. The cluster centers are marked with ‘Ο’.............................................. 33
Figure 9. The clustering results of the proposed method applied to separate shell
clusters. The cluster centers are marked with ‘Ο’. .............................................. 34
Figure 10. Population Markov chain modeling for the conventional GA. ........................ 37
Figure 11. Average maximum fitness values given by the three methods tested 10
times on 100 artificial data sets (using the first set of parameters).................... 52
Figure 12. Average maximum fitness values given by the three methods tested 10
times on 100 artificial data sets (using the second set of parameters). .............. 52
Figure 13. Average processing time for one data point required by each of the three
methods tested 10 times on 100 artificial data sets (using the first set of
parameters).............................................................................................................. 53
List of Figures
IV
Figure 14. Average processing time for one data point required by each of the three
methods tested 10 times on 100 artificial data sets. (using the second set of
parameters).............................................................................................................. 53
Figure 15. Some clustering results of the WPMCC method. The cluster centers are
marked with ‘Ο’. ..................................................................................................... 54
Figure 16. Some clustering results of the PMCC method. The cluster centers are
marked with ‘Ο’. ..................................................................................................... 55
Figure 17. The maximum fitness values produced in every generation for each of the
three methods tested on the data sets as shown in Figures 15 and 16................ 56
Figure 18. Some clustering results of the WPMCC method for some data sets. The
cluster centers are marked with ‘Ο’...................................................................... 57
Figure 19. Some clustering results of the PMCC method for some data sets. The
cluster centers are marked with ‘Ο’...................................................................... 58
Figure 20. Comparison of the absolute correctly clustering rates using two sets of
parameters tested on 100 artificial data sets. ....................................................... 59
Figure 21. Some clustering results of the WPMCC method (a). & (c). without the
winner-replacing step, (b). & (d). with the winner-replacing step. .................... 61
Figure 22. The main menu window of the system showing an input data set. .................. 68
Figure 23. The window for parameter-setup........................................................................ 69
Figure 24. The main menu window of the system showing the clustering result.............. 70
參考文獻 BIBLIOGRAPHY
[1] S. Theodoridis and K. Koutroumbas, Pattern Recognition, Academic Press, New York, 1999.
[2] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, (2nd Edition), John-Wiley, New York, 2001.
[3] P. Arabie, L. J. Hubert, and D. De Soete, Clustering and Classification, World Scientific, New York, 1996.
[4] D. E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley, New York, 1989.
[5] Z. Michalewicz, Genetic Algorithms + Data Structures = Evolution Programs, Springer-Verlag, New York, 1992.
[6] M. Mitchell, An Introduction to Genetic Algorithms, The MIT press, London, England, 1996.
[7] M. Gen and R. Cheng, Genetic Algorithms & Engineering Design, John Wiley & Sons, New York, 1997.
[8] V. V. Raghavan and K. Birchand, “A clustering strategy based on a formalism of the reproductive process in a natural system”, Proceeding of 2nd International Conference on Information Storage and Retrieval, pp. 10-22, 1979.
[9] D. Jones and M. A. Beltramo, “Solving partitioning problems with genetic algorithms”, Proceeding of the 4th International Conference on Genetic Algorithms, pp. 442-449, 1991.
[10] G. P. Babu and M. N. Murty, “Clustering with evolution strategies”, Pattern Recognition, 27, pp. 321-329, 1994.
[11] U. Maulik and S. Bandyopadhyay, “Genetic algorithm-based clustering technique”, Pattern Recognition, 33, pp. 1455-1465, 2000.
[12] S. Bandyopadhyay and U. Maulik, “Genetic clustering for automatic evolution of clusters and application to image classification”, Pattern Recognition, 35, pp. 1197-1208, 2002.
[13] Y. Gao, Z. B. Xu, and G. Li, “Theoretical analyses, new algorithms, and some applications of genetic algorithms: A review of some recent work”, in Fuzzy Logic and Soft Computing, K. Y. Cai, G. Chen and M. Ying, Kluwer Academic, New York, pp. 121-134, 1999.
[14] D. L. Davis and D. W. Bouldin, “A cluster separation measure”, IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. 1, No. 4, pp. 224-227, 1979.
[15] J. C. Back, “Some new indexes of cluster validity”, IEEE Trans. Systems, Man, and Cybernetics—Part B: Cybernetics, Vol. 28, No. 3, pp. 301-315, 1998.
[16] S. Bandyopadhyay and U. Maulik, “Nonparametric genetic clustering: comparison of validity indices”, IEEE Trans. Systems, Man, and Cybernetics—Part C: Application and reviews, Vol. 31, No. 1, pp. 120-125, 2001.
[17] J. H. Holland, Adaption in natural and artificial systems, The MIT press, London, England, 1975.
[18] T. Bäck, Evolutionary Algorithms in Theory and Practice, New York: Oxford Univ. Press, 1996.
[19] J. D. Schaffer, R. A. Caruna, L. J. Eshelman, and R. Das, “A Study of control parameters affecting online performance of genetic algorithms for function optimization”, in Proc. 3rd Intl. Conf. Genetic Algorithms, J.D. Schaffer(Ed.), San Mateo, CA: Morgan Kaufman, pp. 859-866, 1989.
[20] F. Höppner, F. Klawonn, R. Kruse, and T. Runkler, Fuzzy Cluster Analysis, John-Wiley, New York, 1999.
[21] R. N. Dave and K. Bhaswan, “Adaptive fuzzy c-shells clustering and detection of ellipses”, IEEE Trans. Neural Networks, Vol. 3, No. 5, pp. 643-662, 1992.
[22] H. J. Lin, F. W. Yang, and Y. T. Kao, “Efficient Clustering based on Population Markov Chain”, Proceedings of the IASTED International Conference on Modeling, Simulation and Optimization (ICMSO2004), pp. 117-123, 2004.
[23] A. E. Eiben, E. H. L. Arts, and K. M. Van Hee. “Global convergence of genetic algorithms: A Markov chain analysis”, in Parallel Problem Solving from Nature, H. P. Schwefel and R. Manner, Springer, Berlin and Heideberg, pp. 4-12, 1991.
[24] Y. Leung, Y. Gao, and Z. B. Xu, “Degree of population diversity: A perspective on premature convergence in genetic algorithms and its Markov chain analysis”, IEEE Trans. Neural Networks, Vol. 8, No. 5, pp. 1165-1176, 1997.
[25] M. C. Su and C. H. Chou, “A Modified Version of the K-Means Algorithm with a Distance Based on Cluster Symmetry”, IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. 23, June, pp. 674-680, 2001.
論文使用權限
  • 同意紙本無償授權給館內讀者為學術之目的重製使用,於2005-06-30公開。
  • 同意授權瀏覽/列印電子全文服務,於2005-06-30起公開。


  • 若您有任何疑問,請與我們聯絡!
    圖書館: 請來電 (02)2621-5656 轉 2281 或 來信