淡江大學覺生紀念圖書館 (TKU Library)
進階搜尋


下載電子全文限經由淡江IP使用) 
系統識別號 U0002-2502201316053200
中文論文名稱 雲端運算系統中資料容錯與恢復之探討
英文論文名稱 Investigating Data Fault Tolerance and Recovery in Cloud Computing Systems
校院名稱 淡江大學
系所名稱(中) 電機工程學系碩士班
系所名稱(英) Department of Electrical Engineering
學年度 101
學期 1
出版年 102
研究生中文姓名 王竣星
研究生英文姓名 Chung-Hsing Wang
電子信箱 wchsing225@gmail.com
學號 699450226
學位類別 碩士
語文別 中文
第二語文別 中文
口試日期 2013-01-07
論文頁數 73頁
口試委員 指導教授-莊博任
委員-李維聰
委員-陳省隆
中文關鍵字 容錯率  非硬碟檢察  XOR運算  備份 
英文關鍵字 Fault-tolerance  Diskless Checkpoint  Exclusive-OR  Recovery  Backup 
學科別分類 學科別應用科學電機及電子
中文摘要 在資訊科技的迅速進步下,資料傳輸的量也以倍數成長,因此資料的保存就更顯重要。當伺服器發生錯誤的時候,就可以馬上利用別台伺服器來修復遺失的資料,來確保資料的正確性。
檢查點(checkpoint)是容錯系統中做為資料的重要備份技術,根據不同的備份方式,備份資料以及恢復時間也就有所不同,MA [11]改進了傳統neighbor-based scheme的方式,利用互助的方式提升容錯率,且容錯率將可達到 k/2k+1。DDC [12]以矩陣方式來做備份的運算,利用矩陣相乘產生備份資料,產生出一組固定矩陣M, B為原本的資料,C為產生出來的encoded checkpoint,M*B=C,當B發生錯誤,我們想要恢復B的資料時,恢復方程式如下: M -1*M*B= M -1*C,就可算出原來的矩陣B了。G.J. code [13] 的策略是為任兩台伺服器之間,都會有間隔距離,也就是伺服器的編號差,利用這些編號差,產生出partial sum restricted sequence (PSR sequence)數列,即任意兩個數字不能重複,任兩個以上的數字總和也不能等於後面任意其中一個,在這PSR的策略下,當資料發生錯誤時,就可以立即恢復錯誤伺服器的資料。
我們所提出的方法是把六台伺服器設為一組group,並把每組group內的伺服器都給予編號,每台伺服器都會把自己的data傳給自己編號加2和加3的伺服器buffer裡,每組group都可以容忍隨意三個以下的伺服器發生錯誤,容錯率達到3/6。因為在平行處理的情況下,每組group都是同時進行的,並不會互相影響,所以不管有多少組group,運算及復原時間都跟只有一組group是一樣的,且buffer內的XOR運算只需一次。
我們用C語言建立出模擬架構,模擬的結果證明我們所提出來的方法比其他方式的容錯成功率優異許多,時間也花得較少。同時我們利用分組的概念來改進其他論文buffer內需要大量運算的缺點,因此節省運算的時間。
英文摘要 By the rapid development of information technology, amount of data transfer also grow in multiples. Therefore, the preservation of information is even more important. It can use others processor to recovery data to ensure the correctness of the data when the processor fail.
Checkpoint is the important technology to backup the data in fault-tolerant. According to the different backup mode, backup data and recovery time won’t be the same. MA [11] improves tradition neighbor-based scheme. Using mutual assistance to enhance fault-tolerance reach k/2k+1(k is the number of fail). DDC [12] uses matrix multiplication to backup the data. First make a square matrix M, B is the initial data. Use M multiplied by B is C (M*B=C). C is the encoded checkpoint. We can use formula M -1*M*B= M -1*C to recover B when it is fail. About G.J. code [13] method, there will be interval distance between any two processors. It’s mean the processor ID distance. Use processor ID distance to produce partial sum restricted sequence (PSR sequence). Any two numbers can’t be repeated and two or more consecutive sum can’t equal to anyone. Under the PSR strategy, we can immediately recovery the data when the processor is failed.
We propose to put six processors to be a group, and give any processor a ID number. Each processor Pi will send the data to Pi+2, Pi+2. The number of maximum fault tolerance of each three in the group, so the fault-tolerance can reach 50%. Duo to the parallel computing, all the groups are running at the same time and they won’t influence each other. However the number of groups, the time for backup and recovery are the same with one group, and it just spend one time to do XOR.
We use language C to establish a simulation environment. The simulation results prove that our proposed method is much better than other methods of fault-tolerance and recovery time. At the same time, we take advantage of the group-based method to improve the disadvantage of other methods requires a lot of computing.
論文目次 目 錄
第一章 緒論 1
1.1 論文簡介 1
1.2 論文架構 5
第二章 相關研究背景 6
2.1 Diskless Checkpoint 簡介 6
2.2 Mutual-Aid 7
2.2.1 Mutual-Aid based 7
2.2.2 Mutual-Aid protocol 10
2.2.3 資料恢復方式 12
2.3 Distributed Diskless checkpoint 15
2.3.1 DDC based 15
2.3.2 DDC protocol 16
2.3.3 系統資料的恢復 19
2.4 A New Diskless Checkpointing Approach for Multiple Processor Failures 20
2.4.1 G. J. code based 20
2.4.2 G. J. code protocol 22
2.4.3 PSR規則 24
2.4.4 資料的恢復方式 25
第三章 新方法 27
3.1分組容錯策略之架構 27
3.1.1 Group-based recovery method 27
3.1.2 Group-based recovery method 分組 27
3.1.3 策略修復錯誤方式 29
3.1.4 備份與修復的時間計算 31
3.1.5 Group-based 流程圖 31
3.1.6 策略分組的決定 32
3.1.7 錯誤數決定組內伺服器數 35
第四章 模擬與比較 38
4.1 各策略存活次數 38
4.2 可靠度與容錯率關係 40
4.3 伺服器的MTBF與容錯成功率 44
4.4 伺服器各別錯誤的情形 46
4.5 間隔時間下的成功累積率 54
4.6 恢復時間比較 56
4.7 增加成本評比效益 58
4.8伺服器總量的增長 62
4.9 各策略的方式比較 67
第五章 結論與未來工作 68
第六章 參考文獻 71

圖目錄
圖2. 1 Checkpoint 做法的改變[14] 7
圖2. 2早期 Diskless Checkpoint的做法-1[12] 8
圖2. 3早期 Diskless Checkpoint的做法-2[12] 8
圖2. 4早期 Diskless Checkpoint的做法-3 [12] 9
圖2. 5 Mutual-Aid的方法圖[12] 10
圖2. 6 Mutual-Aid修復動作[12] 11
圖2. 7 Mutual-Aid多伺服器的修復-1[12] 12
圖2. 8 Mutual-Aid多伺服器的修復-2[12] 13
圖2. 9 Mutual-Aid無法修復情形[12] 14
圖2. 10 DDC的伺服器內容-1 16
圖2. 11 DDC的陣列內容[11] 17
圖2. 12 DDC的伺服器內容-2 18
圖2. 13 DDC策略的pseudo code[11] 18
圖2. 14環形Checkpoint的方法[12] 21
圖2. 15 G.J. code策略分配圖[13] 22
圖2. 16錯誤分配情形[13] 23
圖2. 17正確分配情形[13] 24
圖2. 18 PSR產生的結果[13] 26
圖3. 1 Group-based策略圖 28
圖3. 2 Group-based流程圖 32
圖3. 3當每組為9台伺服器圖形 34
圖3. 4組內容錯數為4 35
圖3. 5組內連續容錯數為4 36
圖4. 1 各策略存活次數圖 39
圖4. 2可靠度與容錯成功關係圖 43
圖4. 3伺服器可靠度為變數關係圖 45
圖4. 4 各別伺服器錯誤情形 48
圖4. 5θ與mean趨勢 52
圖4. 6 伺服器錯誤的trace 53
圖4. 7各策略成功累積率比較 55
圖4. 8恢復時間比較圖 57
圖4. 9 增加buffer-1 59
圖4. 10 增加buffer效益圖 60
圖4. 11增加buffer-2 61
圖4. 12 總數及可靠度固定時關係-1 62
圖4. 13總數及可靠度固定時關係-2 64
圖4. 14合併前後容錯成功率差異 66

表目錄
表2. 1MA在不同伺服器總數及錯誤數的恢復率[12] 14
表2. 2MA策略與其他評比[12] 15
表4. 1存活次數平均值 39
表4. 2時間與可靠度 42
表4. 3可靠度與錯誤數 42
表4. 4 mean值產生的亂數 47
表4. 5 錯誤發生的時間點下的θ值 47
表4. 6表值不同間隔累積錯誤數 49
表4. 7 θ值不同間隔累積錯誤數 50
表4. 8可靠度相同mean與θ關係 51
表4. 9mean值與θ相同時的趨勢 52
表4. 10各策略與錯誤數關係 55
表4. 11錯誤數值 57
表4. 12合併前後容錯率比較 65
表4. 13各策略比較表 67

參考文獻 [1] Z. Chen, G. E. Fagg, E. Gabriel, J. Langou, T. Angskun, G. Bosilca, and J. Dongarra, "Fault Tolerant High Performance Computing by a Coding Approach, " Proc. ACM Symp. Principles and Practice of Parallel Programming (PPoPP’05), June 2005, pp. 213-223.
[2] T.-C. Chiueh and P. Deng, "Evaluation of Checkpoint Mechanisms for Massively Parallel Machines," Proc. IEEE Symp. Fault Tolerant Computing (FTCS’96), June 1996, pp. 370-379.
[3] L. M. Silva and J. G. Silva, "An Experimental Study about Diskless Checkpointing," Proc. Euromicro Conference (EUROMICRO’98), Aug. 1998, pp. 395-402.
[4] J.-F. Chiu and G.-M. Chiu, "Hardware-Supported Asynchronous Checkpointing Scheme," IEE Proceedings - Computers and Digital Techniques, vol. 145, no. 2, Mar. 1998, pp. 109-115.
[5] J. S. Plank, Y. Kim, and J. Dongarra. "Algorithm-based Diskless Checkpointing for Fault Tolerant Matrix Operations," Proc. IEEE Symp. Fault-Tolerant Computing (FTCS’95), June.1995, pp. 351-360.
[6] J. S. Plank and K. Li. “Faster Checkpointing with N + 1 Parity,” Proc. IEEE Symp. Fault-Tolerant Computing (FTCS’94), June 1994, pp. 288-297.
[7] J. S. Plank, K. Li, and M. A. Puening. “Diskless Checkpointing,” IEEE Trans. Parallel Distributed Systems, vol. 9, no. 10, Oct. 1998, pp. 972-986.
[8] Z. Chen and J. Dongarra, "A Scalable Checkpoint Encoding Algorithm for Diskless Checkpointing," Proc. IEEE Symp. High Assurance Systems Engineering Symposium (HASE 2008), Dec.2008, pp. 71-79.
[9] Z. Chen and J. Dongarra, "Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing," IEEE Trans. Computers, vol. 58, no. 11, Nov. 2009, pp. 1512-1524.
[10] J. S. Plank. "A Tutorial on Reed-Solomon Coding for Fault-Tolerance in RAID-like Systems," Software – Practice & Experience, vol. 27, no. 9, Sep. 1997, pp. 995-1012.
[11] L. Bautista Gomez, N. Maruyama, F. Cappello, S. Matsuoka, “Distributed Diskless Checkpoint for large scale systems”, IEEE/ACM International Symposium on Cluster, Cloud and Grid computing (CCGrid2010),Melbourne, Australia, May 2010.
[12] J.-F. Chiu and W.-H. Hao, "Mutual-Aid: Diskless Checkpointing Scheme for Tolerating Double Faults," Proc. IEEE Symp. High Performance Computing and Communications ( HPCC'08), Sep. 2008, pp.540-547.
[13] G.M. Chiu and J.F. Chiu, A New Diskless Checkpointing Approach for
Multiple Processor Failures. IEEE Transactions on Dependable and
Secure Computing 8(4), 481–493 (2011).
[14] J. Plank, K. Li, M. A. Puening, Diskless Checkpointing, IEEE
Transactions on Parallel and Distributed Systems, v.9 n.10, October 1998, p.972-986.
[15]Zizhong Chen, and et 12. , “Testing and fault tolerance: Fault tolerant high performance computing by a coding approach”, Proceedings of the tenth ACM SIGPLAN symposium on PPoPP '05, pp. 93 – 103.
[16]Tzi-Cker Chiueh and Peitao Deng; “Evaluation of checkpoint mechanisms for massively parallel machines”, Proceedings of Annual Symposium on Fault Tolerant Computing, 1996., 25-27 June 1996, pp. 370 – 379.
[17]Jane-Ferng Chiu and Ge-Ming Chiu, “Hardwaresupported asynchronous checkpointing scheme”, Computers and Digital Techniques, IEE Proceedings, Volume 145, Issue 2, March 1998, pp. 109–115.
[18]J.S. Plank. "A Tutorial on Reed-Solomon Coding for Fault-Tolerance in RAID-like Systems,"Software – Practice & Experience, vol. 27, no. 9, Sep. 1997, pp. 995-1012.
[19]IDS, Raytheon and Keyport, WA, ‘‘How to estimate and use MTTF/MTBF would the real MTBF please stand up?” Annual reliability and maintainability symposium, 2009, pp.353-359.
論文使用權限
  • 同意紙本無償授權給館內讀者為學術之目的重製使用,於2014-03-05公開。
  • 同意授權瀏覽/列印電子全文服務,於2014-03-05起公開。


  • 若您有任何疑問,請與我們聯絡!
    圖書館: 請來電 (02)2621-5656 轉 2281 或 來信