電子學位論文服務

§ 瀏覽學位論文書目資料

本論文電子全文於2014-03-05起於校外公開使用
本論文紙本於2014-03-05起公開使用

系統識別號	U0002-2502201316053200
DOI	10.6846/TKU.2013.01020
論文名稱(中文)	雲端運算系統中資料容錯與恢復之探討
論文名稱(英文)	Investigating Data Fault Tolerance and Recovery in Cloud Computing Systems
第三語言論文名稱
校院名稱	淡江大學
系所名稱(中文)	電機工程學系碩士班
系所名稱(英文)	Department of Electrical and Computer Engineering
外國學位學校名稱
外國學位學院名稱
外國學位研究所名稱
學年度	101
學期	1
出版年	102
研究生(中文)	王竣星
研究生(英文)	Chung-Hsing Wang
學號	699450226
學位類別	碩士
語言別	繁體中文
第二語言別	繁體中文
口試日期	2013-01-07
論文頁數	73頁
口試委員	指導教授 - 莊博任委員 - 李維聰委員 - 陳省隆
關鍵字(中)	容錯率非硬碟檢察 XOR運算備份
關鍵字(英)	Fault-tolerance Diskless Checkpoint Exclusive-OR Recovery Backup
第三語言關鍵字
學科別分類
中文摘要	在資訊科技的迅速進步下，資料傳輸的量也以倍數成長，因此資料的保存就更顯重要。當伺服器發生錯誤的時候，就可以馬上利用別台伺服器來修復遺失的資料，來確保資料的正確性。檢查點(checkpoint)是容錯系統中做為資料的重要備份技術，根據不同的備份方式，備份資料以及恢復時間也就有所不同，MA [11]改進了傳統neighbor-based scheme的方式，利用互助的方式提升容錯率，且容錯率將可達到 k/2k+1。DDC [12]以矩陣方式來做備份的運算，利用矩陣相乘產生備份資料，產生出一組固定矩陣M， B為原本的資料，C為產生出來的encoded checkpoint，MB=C，當B發生錯誤，我們想要恢復B的資料時，恢復方程式如下: M -1MB= M -1C，就可算出原來的矩陣B了。G.J. code [13] 的策略是為任兩台伺服器之間，都會有間隔距離，也就是伺服器的編號差，利用這些編號差，產生出partial sum restricted sequence (PSR sequence)數列，即任意兩個數字不能重複，任兩個以上的數字總和也不能等於後面任意其中一個，在這PSR的策略下，當資料發生錯誤時，就可以立即恢復錯誤伺服器的資料。我們所提出的方法是把六台伺服器設為一組group，並把每組group內的伺服器都給予編號，每台伺服器都會把自己的data傳給自己編號加2和加3的伺服器buffer裡，每組group都可以容忍隨意三個以下的伺服器發生錯誤，容錯率達到3/6。因為在平行處理的情況下，每組group都是同時進行的，並不會互相影響，所以不管有多少組group，運算及復原時間都跟只有一組group是一樣的，且buffer內的XOR運算只需一次。我們用C語言建立出模擬架構，模擬的結果證明我們所提出來的方法比其他方式的容錯成功率優異許多，時間也花得較少。同時我們利用分組的概念來改進其他論文buffer內需要大量運算的缺點，因此節省運算的時間。
英文摘要	By the rapid development of information technology, amount of data transfer also grow in multiples. Therefore, the preservation of information is even more important. It can use others processor to recovery data to ensure the correctness of the data when the processor fail. Checkpoint is the important technology to backup the data in fault-tolerant. According to the different backup mode, backup data and recovery time won’t be the same. MA [11] improves tradition neighbor-based scheme. Using mutual assistance to enhance fault-tolerance reach k/2k+1(k is the number of fail). DDC [12] uses matrix multiplication to backup the data. First make a square matrix M, B is the initial data. Use M multiplied by B is C (MB=C). C is the encoded checkpoint. We can use formula M -1MB= M -1C to recover B when it is fail. About G.J. code [13] method, there will be interval distance between any two processors. It’s mean the processor ID distance. Use processor ID distance to produce partial sum restricted sequence (PSR sequence). Any two numbers can’t be repeated and two or more consecutive sum can’t equal to anyone. Under the PSR strategy, we can immediately recovery the data when the processor is failed. We propose to put six processors to be a group, and give any processor a ID number. Each processor Pi will send the data to Pi+2, Pi+2. The number of maximum fault tolerance of each three in the group, so the fault-tolerance can reach 50%. Duo to the parallel computing, all the groups are running at the same time and they won’t influence each other. However the number of groups, the time for backup and recovery are the same with one group, and it just spend one time to do XOR. We use language C to establish a simulation environment. The simulation results prove that our proposed method is much better than other methods of fault-tolerance and recovery time. At the same time, we take advantage of the group-based method to improve the disadvantage of other methods requires a lot of computing.
第三語言摘要
論文目次	目錄第一章緒論 1 1.1 論文簡介 1 1.2 論文架構 5 第二章相關研究背景 6 2.1 Diskless Checkpoint 簡介 6 2.2 Mutual-Aid 7 2.2.1 Mutual-Aid based 7 2.2.2 Mutual-Aid protocol 10 2.2.3 資料恢復方式 12 2.3 Distributed Diskless checkpoint 15 2.3.1 DDC based 15 2.3.2 DDC protocol 16 2.3.3 系統資料的恢復 19 2.4 A New Diskless Checkpointing Approach for Multiple Processor Failures 20 2.4.1 G. J. code based 20 2.4.2 G. J. code protocol 22 2.4.3 PSR規則 24 2.4.4 資料的恢復方式 25 第三章新方法 27 3.1分組容錯策略之架構 27 3.1.1 Group-based recovery method 27 3.1.2 Group-based recovery method 分組 27 3.1.3 策略修復錯誤方式 29 3.1.4 備份與修復的時間計算 31 3.1.5 Group-based 流程圖 31 3.1.6 策略分組的決定 32 3.1.7 錯誤數決定組內伺服器數 35 第四章模擬與比較 38 4.1 各策略存活次數 38 4.2 可靠度與容錯率關係 40 4.3 伺服器的MTBF與容錯成功率 44 4.4 伺服器各別錯誤的情形 46 4.5 間隔時間下的成功累積率 54 4.6 恢復時間比較 56 4.7 增加成本評比效益 58 4.8伺服器總量的增長 62 4.9 各策略的方式比較 67 第五章結論與未來工作 68 第六章參考文獻 71 圖目錄圖2. 1 Checkpoint 做法的改變[14] 7 圖2. 2早期 Diskless Checkpoint的做法-1[12] 8 圖2. 3早期 Diskless Checkpoint的做法-2[12] 8 圖2. 4早期 Diskless Checkpoint的做法-3 [12] 9 圖2. 5 Mutual-Aid的方法圖[12] 10 圖2. 6 Mutual-Aid修復動作[12] 11 圖2. 7 Mutual-Aid多伺服器的修復-1[12] 12 圖2. 8 Mutual-Aid多伺服器的修復-2[12] 13 圖2. 9 Mutual-Aid無法修復情形[12] 14 圖2. 10 DDC的伺服器內容-1 16 圖2. 11 DDC的陣列內容[11] 17 圖2. 12 DDC的伺服器內容-2 18 圖2. 13 DDC策略的pseudo code[11] 18 圖2. 14環形Checkpoint的方法[12] 21 圖2. 15 G.J. code策略分配圖[13] 22 圖2. 16錯誤分配情形[13] 23 圖2. 17正確分配情形[13] 24 圖2. 18 PSR產生的結果[13] 26 圖3. 1 Group-based策略圖 28 圖3. 2 Group-based流程圖 32 圖3. 3當每組為9台伺服器圖形 34 圖3. 4組內容錯數為4 35 圖3. 5組內連續容錯數為4 36 圖4. 1 各策略存活次數圖 39 圖4. 2可靠度與容錯成功關係圖 43 圖4. 3伺服器可靠度為變數關係圖 45 圖4. 4 各別伺服器錯誤情形 48 圖4. 5θ與mean趨勢 52 圖4. 6 伺服器錯誤的trace 53 圖4. 7各策略成功累積率比較 55 圖4. 8恢復時間比較圖 57 圖4. 9 增加buffer-1 59 圖4. 10 增加buffer效益圖 60 圖4. 11增加buffer-2 61 圖4. 12 總數及可靠度固定時關係-1 62 圖4. 13總數及可靠度固定時關係-2 64 圖4. 14合併前後容錯成功率差異 66 表目錄表2. 1MA在不同伺服器總數及錯誤數的恢復率[12] 14 表2. 2MA策略與其他評比[12] 15 表4. 1存活次數平均值 39 表4. 2時間與可靠度 42 表4. 3可靠度與錯誤數 42 表4. 4 mean值產生的亂數 47 表4. 5 錯誤發生的時間點下的θ值 47 表4. 6表值不同間隔累積錯誤數 49 表4. 7 θ值不同間隔累積錯誤數 50 表4. 8可靠度相同mean與θ關係 51 表4. 9mean值與θ相同時的趨勢 52 表4. 10各策略與錯誤數關係 55 表4. 11錯誤數值 57 表4. 12合併前後容錯率比較 65 表4. 13各策略比較表 67
參考文獻	[1] Z. Chen, G. E. Fagg, E. Gabriel, J. Langou, T. Angskun, G. Bosilca, and J. Dongarra, "Fault Tolerant High Performance Computing by a Coding Approach, " Proc. ACM Symp. Principles and Practice of Parallel Programming (PPoPP’05), June 2005, pp. 213-223. [2] T.-C. Chiueh and P. Deng, "Evaluation of Checkpoint Mechanisms for Massively Parallel Machines," Proc. IEEE Symp. Fault Tolerant Computing (FTCS’96), June 1996, pp. 370-379. [3] L. M. Silva and J. G. Silva, "An Experimental Study about Diskless Checkpointing," Proc. Euromicro Conference (EUROMICRO’98), Aug. 1998, pp. 395-402. [4] J.-F. Chiu and G.-M. Chiu, "Hardware-Supported Asynchronous Checkpointing Scheme," IEE Proceedings - Computers and Digital Techniques, vol. 145, no. 2, Mar. 1998, pp. 109-115. [5] J. S. Plank, Y. Kim, and J. Dongarra. "Algorithm-based Diskless Checkpointing for Fault Tolerant Matrix Operations," Proc. IEEE Symp. Fault-Tolerant Computing (FTCS’95), June.1995, pp. 351-360. [6] J. S. Plank and K. Li. “Faster Checkpointing with N + 1 Parity,” Proc. IEEE Symp. Fault-Tolerant Computing (FTCS’94), June 1994, pp. 288-297. [7] J. S. Plank, K. Li, and M. A. Puening. “Diskless Checkpointing,” IEEE Trans. Parallel Distributed Systems, vol. 9, no. 10, Oct. 1998, pp. 972-986. [8] Z. Chen and J. Dongarra, "A Scalable Checkpoint Encoding Algorithm for Diskless Checkpointing," Proc. IEEE Symp. High Assurance Systems Engineering Symposium (HASE 2008), Dec.2008, pp. 71-79. [9] Z. Chen and J. Dongarra, "Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing," IEEE Trans. Computers, vol. 58, no. 11, Nov. 2009, pp. 1512-1524. [10] J. S. Plank. "A Tutorial on Reed-Solomon Coding for Fault-Tolerance in RAID-like Systems," Software – Practice & Experience, vol. 27, no. 9, Sep. 1997, pp. 995-1012. [11] L. Bautista Gomez, N. Maruyama, F. Cappello, S. Matsuoka, “Distributed Diskless Checkpoint for large scale systems”, IEEE/ACM International Symposium on Cluster, Cloud and Grid computing (CCGrid2010),Melbourne, Australia, May 2010. [12] J.-F. Chiu and W.-H. Hao, "Mutual-Aid: Diskless Checkpointing Scheme for Tolerating Double Faults," Proc. IEEE Symp. High Performance Computing and Communications ( HPCC'08), Sep. 2008, pp.540-547. [13] G.M. Chiu and J.F. Chiu, A New Diskless Checkpointing Approach for Multiple Processor Failures. IEEE Transactions on Dependable and Secure Computing 8(4), 481–493 (2011). [14] J. Plank, K. Li, M. A. Puening, Diskless Checkpointing, IEEE Transactions on Parallel and Distributed Systems, v.9 n.10, October 1998, p.972-986. [15]Zizhong Chen, and et 12. , “Testing and fault tolerance: Fault tolerant high performance computing by a coding approach”, Proceedings of the tenth ACM SIGPLAN symposium on PPoPP '05, pp. 93 – 103. [16]Tzi-Cker Chiueh and Peitao Deng; “Evaluation of checkpoint mechanisms for massively parallel machines”, Proceedings of Annual Symposium on Fault Tolerant Computing, 1996., 25-27 June 1996, pp. 370 – 379. [17]Jane-Ferng Chiu and Ge-Ming Chiu, “Hardwaresupported asynchronous checkpointing scheme”, Computers and Digital Techniques, IEE Proceedings, Volume 145, Issue 2, March 1998, pp. 109–115. [18]J.S. Plank. "A Tutorial on Reed-Solomon Coding for Fault-Tolerance in RAID-like Systems,"Software – Practice & Experience, vol. 27, no. 9, Sep. 1997, pp. 995-1012. [19]IDS, Raytheon and Keyport, WA, ‘‘How to estimate and use MTTF/MTBF would the real MTBF please stand up?” Annual reliability and maintainability symposium, 2009, pp.353-359.
論文全文使用權限	校內：紙本論文於授權書繳交後1年公開同意電子論文全文授權校園內公開校內電子論文於授權書繳交後1年公開校外：同意授權校外電子論文於授權書繳交後1年公開

返回頁首

如有問題，歡迎洽詢！
圖書館數位資訊組　(02)2621-5656 轉 2487 或來信