§ 瀏覽學位論文書目資料
  
系統識別號 U0002-2502201316053200
DOI 10.6846/TKU.2013.01020
論文名稱(中文) 雲端運算系統中資料容錯與恢復之探討
論文名稱(英文) Investigating Data Fault Tolerance and Recovery in Cloud Computing Systems
第三語言論文名稱
校院名稱 淡江大學
系所名稱(中文) 電機工程學系碩士班
系所名稱(英文) Department of Electrical and Computer Engineering
外國學位學校名稱
外國學位學院名稱
外國學位研究所名稱
學年度 101
學期 1
出版年 102
研究生(中文) 王竣星
研究生(英文) Chung-Hsing Wang
學號 699450226
學位類別 碩士
語言別 繁體中文
第二語言別 繁體中文
口試日期 2013-01-07
論文頁數 73頁
口試委員 指導教授 - 莊博任
委員 - 李維聰
委員 - 陳省隆
關鍵字(中) 容錯率
非硬碟檢察
XOR運算
備份
關鍵字(英) Fault-tolerance
Diskless Checkpoint
Exclusive-OR
Recovery
Backup
第三語言關鍵字
學科別分類
中文摘要
在資訊科技的迅速進步下,資料傳輸的量也以倍數成長,因此資料的保存就更顯重要。當伺服器發生錯誤的時候,就可以馬上利用別台伺服器來修復遺失的資料,來確保資料的正確性。
   檢查點(checkpoint)是容錯系統中做為資料的重要備份技術,根據不同的備份方式,備份資料以及恢復時間也就有所不同,MA [11]改進了傳統neighbor-based scheme的方式,利用互助的方式提升容錯率,且容錯率將可達到 k/2k+1。DDC [12]以矩陣方式來做備份的運算,利用矩陣相乘產生備份資料,產生出一組固定矩陣M, B為原本的資料,C為產生出來的encoded checkpoint,M*B=C,當B發生錯誤,我們想要恢復B的資料時,恢復方程式如下: M -1*M*B= M -1*C,就可算出原來的矩陣B了。G.J. code [13] 的策略是為任兩台伺服器之間,都會有間隔距離,也就是伺服器的編號差,利用這些編號差,產生出partial sum restricted sequence (PSR sequence)數列,即任意兩個數字不能重複,任兩個以上的數字總和也不能等於後面任意其中一個,在這PSR的策略下,當資料發生錯誤時,就可以立即恢復錯誤伺服器的資料。
    我們所提出的方法是把六台伺服器設為一組group,並把每組group內的伺服器都給予編號,每台伺服器都會把自己的data傳給自己編號加2和加3的伺服器buffer裡,每組group都可以容忍隨意三個以下的伺服器發生錯誤,容錯率達到3/6。因為在平行處理的情況下,每組group都是同時進行的,並不會互相影響,所以不管有多少組group,運算及復原時間都跟只有一組group是一樣的,且buffer內的XOR運算只需一次。
    我們用C語言建立出模擬架構,模擬的結果證明我們所提出來的方法比其他方式的容錯成功率優異許多,時間也花得較少。同時我們利用分組的概念來改進其他論文buffer內需要大量運算的缺點,因此節省運算的時間。
英文摘要
By the rapid development of information technology, amount of data transfer also grow in multiples. Therefore, the preservation of information is even more important. It can use others processor to recovery data to ensure the correctness of the data when the processor fail. 
Checkpoint is the important technology to backup the data in fault-tolerant. According to the different backup mode, backup data and recovery time won’t be the same. MA [11] improves tradition neighbor-based scheme. Using mutual assistance to enhance fault-tolerance reach k/2k+1(k is the number of fail). DDC [12] uses matrix multiplication to backup the data. First make a square matrix M, B is the initial data. Use M multiplied by B is C (M*B=C). C is the encoded checkpoint. We can use formula M -1*M*B= M -1*C to recover B when it is fail. About G.J. code [13] method, there will be interval distance between any two processors. It’s mean the processor ID distance. Use processor ID distance to produce partial sum restricted sequence (PSR sequence). Any two numbers can’t be repeated and two or more consecutive sum can’t equal to anyone. Under the PSR strategy, we can immediately recovery the data when the processor is failed. 
We propose to put six processors to be a group, and give any processor a ID number. Each processor Pi will send the data to Pi+2, Pi+2. The number of maximum fault tolerance of each three in the group, so the fault-tolerance can reach 50%. Duo to the parallel computing, all the groups are running at the same time and they won’t influence each other. However the number of groups, the time for backup and recovery are the same with one group, and it just spend one time to do XOR. 
We use language C to establish a simulation environment. The simulation results prove that our proposed method is much better than other methods of fault-tolerance and recovery time. At the same time, we take advantage of the group-based method to improve the disadvantage of other methods requires a lot of computing.
第三語言摘要
論文目次
目 錄
第一章	緒論	1
1.1 論文簡介	1
1.2 論文架構	5
第二章 相關研究背景	6
2.1  Diskless Checkpoint 簡介	6
2.2 Mutual-Aid	7
2.2.1 Mutual-Aid based	7
2.2.2 Mutual-Aid protocol	10
2.2.3  資料恢復方式	12
2.3  Distributed Diskless checkpoint	15
2.3.1 DDC based	15
2.3.2 DDC protocol	16
2.3.3 系統資料的恢復	19
2.4 A New Diskless Checkpointing Approach for Multiple Processor Failures	20
2.4.1 G. J. code based	20
2.4.2 G. J. code protocol	22
2.4.3  PSR規則	24
2.4.4  資料的恢復方式	25
第三章 新方法	27
3.1分組容錯策略之架構	27
3.1.1 Group-based recovery method	27
3.1.2 Group-based recovery method 分組	27
3.1.3 策略修復錯誤方式	29
3.1.4  備份與修復的時間計算	31
3.1.5  Group-based 流程圖	31
3.1.6  策略分組的決定	32
3.1.7 錯誤數決定組內伺服器數	35
第四章 模擬與比較	38
4.1 各策略存活次數	38
4.2 可靠度與容錯率關係	40
4.3 伺服器的MTBF與容錯成功率	44
4.4 伺服器各別錯誤的情形	46
4.5  間隔時間下的成功累積率	54
4.6  恢復時間比較	56
4.7 增加成本評比效益	58
4.8伺服器總量的增長	62
4.9  各策略的方式比較	67
第五章 結論與未來工作	68
第六章 參考文獻	71

圖目錄
圖2. 1 Checkpoint 做法的改變[14]	7
圖2. 2早期 Diskless Checkpoint的做法-1[12]	8
圖2. 3早期 Diskless Checkpoint的做法-2[12]	8
圖2. 4早期 Diskless Checkpoint的做法-3 [12]	9
圖2. 5 Mutual-Aid的方法圖[12]	10
圖2. 6 Mutual-Aid修復動作[12]	11
圖2. 7 Mutual-Aid多伺服器的修復-1[12]	12
圖2. 8 Mutual-Aid多伺服器的修復-2[12]	13
圖2. 9 Mutual-Aid無法修復情形[12]	14
圖2. 10 DDC的伺服器內容-1	16
圖2. 11 DDC的陣列內容[11]	17
圖2. 12 DDC的伺服器內容-2	18
圖2. 13 DDC策略的pseudo code[11]	18
圖2. 14環形Checkpoint的方法[12]	21
圖2. 15 G.J. code策略分配圖[13]	22
圖2. 16錯誤分配情形[13]	23
圖2. 17正確分配情形[13]	24
圖2. 18 PSR產生的結果[13]	26
圖3. 1 Group-based策略圖	28
圖3. 2 Group-based流程圖	32
圖3. 3當每組為9台伺服器圖形	34
圖3. 4組內容錯數為4	35
圖3. 5組內連續容錯數為4	36
圖4. 1 各策略存活次數圖	39
圖4. 2可靠度與容錯成功關係圖	43
圖4. 3伺服器可靠度為變數關係圖	45
圖4. 4 各別伺服器錯誤情形	48
圖4. 5θ與mean趨勢	52
圖4. 6  伺服器錯誤的trace	53
圖4. 7各策略成功累積率比較	55
圖4. 8恢復時間比較圖	57
圖4. 9 增加buffer-1	59
圖4. 10 增加buffer效益圖	60
圖4. 11增加buffer-2	61
圖4. 12 總數及可靠度固定時關係-1	62
圖4. 13總數及可靠度固定時關係-2	64
圖4. 14合併前後容錯成功率差異	66

表目錄
表2. 1MA在不同伺服器總數及錯誤數的恢復率[12]	14
表2. 2MA策略與其他評比[12]	15
表4. 1存活次數平均值	39
表4. 2時間與可靠度	42
表4. 3可靠度與錯誤數	42
表4. 4 mean值產生的亂數	47
表4. 5 錯誤發生的時間點下的θ值	47
表4. 6表值不同間隔累積錯誤數	49
表4. 7 θ值不同間隔累積錯誤數	50
表4. 8可靠度相同mean與θ關係	51
表4. 9mean值與θ相同時的趨勢	52
表4. 10各策略與錯誤數關係	55
表4. 11錯誤數值	57
表4. 12合併前後容錯率比較	65
表4. 13各策略比較表	67
參考文獻
[1] Z. Chen, G. E. Fagg, E. Gabriel, J. Langou, T. Angskun, G. Bosilca,            and J. Dongarra, "Fault Tolerant High Performance Computing by a  Coding Approach, " Proc. ACM Symp. Principles and Practice of Parallel Programming (PPoPP’05), June 2005, pp. 213-223.
[2] T.-C. Chiueh and P. Deng, "Evaluation of Checkpoint Mechanisms for Massively Parallel Machines," Proc. IEEE Symp. Fault Tolerant Computing (FTCS’96), June 1996, pp. 370-379.
[3] L. M. Silva and J. G. Silva, "An Experimental Study about Diskless Checkpointing," Proc. Euromicro Conference (EUROMICRO’98), Aug. 1998, pp. 395-402.
[4] J.-F. Chiu and G.-M. Chiu, "Hardware-Supported Asynchronous Checkpointing Scheme," IEE Proceedings - Computers and Digital Techniques, vol. 145, no. 2, Mar. 1998, pp. 109-115.
[5] J. S. Plank, Y. Kim, and J. Dongarra. "Algorithm-based Diskless Checkpointing for Fault Tolerant Matrix Operations," Proc. IEEE Symp. Fault-Tolerant Computing (FTCS’95), June.1995, pp. 351-360.
[6] J. S. Plank and K. Li. “Faster Checkpointing with N + 1 Parity,” Proc. IEEE Symp. Fault-Tolerant Computing (FTCS’94), June 1994, pp. 288-297.
[7] J. S. Plank, K. Li, and M. A. Puening. “Diskless Checkpointing,” IEEE Trans. Parallel Distributed Systems, vol. 9, no. 10, Oct. 1998, pp. 972-986.
[8] Z. Chen and J. Dongarra, "A Scalable Checkpoint Encoding Algorithm for Diskless Checkpointing," Proc. IEEE Symp. High Assurance Systems Engineering Symposium (HASE 2008), Dec.2008, pp. 71-79.
[9] Z. Chen and J. Dongarra, "Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing," IEEE Trans. Computers, vol. 58, no. 11, Nov. 2009, pp. 1512-1524.
[10] J. S. Plank. "A Tutorial on Reed-Solomon Coding for Fault-Tolerance in RAID-like Systems," Software – Practice & Experience, vol. 27, no. 9, Sep. 1997, pp. 995-1012.
[11] L. Bautista Gomez, N. Maruyama, F. Cappello, S. Matsuoka, “Distributed Diskless Checkpoint for large scale systems”, IEEE/ACM International Symposium on Cluster, Cloud and Grid computing (CCGrid2010),Melbourne, Australia, May 2010.
[12] J.-F. Chiu and W.-H. Hao, "Mutual-Aid: Diskless Checkpointing Scheme for Tolerating Double Faults," Proc. IEEE Symp. High Performance Computing and Communications ( HPCC'08), Sep. 2008, pp.540-547.
[13] G.M. Chiu and J.F. Chiu, A New Diskless Checkpointing Approach for
     Multiple Processor Failures. IEEE Transactions on Dependable and
     Secure Computing 8(4), 481–493 (2011).
[14] J. Plank, K. Li, M. A. Puening, Diskless Checkpointing, IEEE       
     Transactions on Parallel and Distributed Systems, v.9 n.10, October 1998, p.972-986.
[15]Zizhong Chen, and et 12. , “Testing and fault tolerance: Fault tolerant high performance computing by a coding approach”, Proceedings of the tenth ACM SIGPLAN symposium on PPoPP '05, pp. 93 – 103.
[16]Tzi-Cker Chiueh and Peitao Deng; “Evaluation of checkpoint mechanisms for massively parallel machines”, Proceedings of Annual Symposium on Fault Tolerant Computing, 1996., 25-27 June 1996, pp. 370 – 379.
[17]Jane-Ferng Chiu and Ge-Ming Chiu, “Hardwaresupported asynchronous checkpointing scheme”, Computers and Digital Techniques, IEE Proceedings, Volume 145, Issue 2, March 1998, pp. 109–115.
[18]J.S. Plank. "A Tutorial on Reed-Solomon Coding for Fault-Tolerance in RAID-like Systems,"Software – Practice & Experience, vol. 27, no. 9, Sep. 1997, pp. 995-1012.   
[19]IDS, Raytheon and Keyport, WA, ‘‘How to estimate and use MTTF/MTBF would the real MTBF please stand up?” Annual reliability and maintainability symposium, 2009, pp.353-359.
論文全文使用權限
校內
紙本論文於授權書繳交後1年公開
同意電子論文全文授權校園內公開
校內電子論文於授權書繳交後1年公開
校外
同意授權
校外電子論文於授權書繳交後1年公開

如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信