系統識別號 | U0002-0109202323101600 |
---|---|
DOI | 10.6846/tku202300627 |
論文名稱(中文) | 以多模型深度學習網路協同進行手語教學與評分 |
論文名稱(英文) | Sign language Teaching and Scoring System based on the Collaboration of Multi-model Deep Learning Networks |
第三語言論文名稱 | |
校院名稱 | 淡江大學 |
系所名稱(中文) | 資訊工程學系碩士班 |
系所名稱(英文) | Department of Computer Science and Information Engineering |
外國學位學校名稱 | |
外國學位學院名稱 | |
外國學位研究所名稱 | |
學年度 | 111 |
學期 | 2 |
出版年 | 112 |
研究生(中文) | 林思瀚 |
研究生(英文) | Sze-Hon Lam |
學號 | 610416017 |
學位類別 | 碩士 |
語言別 | 繁體中文 |
第二語言別 | |
口試日期 | 2023-06-09 |
論文頁數 | 78頁 |
口試委員 |
指導教授
-
郭經華(chkuo@mail.tku.edu.tw)
口試委員 - 廖文華 口試委員 - 張志勇 口試委員 - 游國忠 |
關鍵字(中) |
長短期記憶 孿生神經網路 ConvLSTM 動作比對 動作辨識 臺灣手語 適性教學 MediaPipe 人體關鍵點辨識 |
關鍵字(英) |
LSTM Siamese Network ConvLSTM Human Action Comparison Human Action Recognition Taiwanese Sign Language Adaptive Teaching MediaPipe Keypoint Detection |
第三語言關鍵字 | |
學科別分類 | |
中文摘要 |
國際身心障礙者公約(CRPD)強調,「促進、保障與確保所有身心障礙者充分及平等享有所有人權及基本自由,並促進對身心障礙者固有尊嚴之尊重」。以手語作為主要溝通語言的聽語障者因為語言不通,使他們無法與聽人(一般聽力、說話能力正常者)直接溝通,以致日常溝通受到阻礙,人們難以互相理解和尊重,也限制了共融社會的實現。手語的學習難度和普及度成為了阻擋在這個目標前的一面高牆。 與此同時,臺灣政府推行的十二年國教修訂草案將臺灣手語及其他台灣本土語言一同納入為國一、國二必修的語言課程(本土語言、臺灣手語、新住民語文中三個語別中擇一必修),而國三及高中的學生則可以選修臺灣手語。這項政策為臺灣手語的普及提供了有利條件,同時也強調了提高臺灣手語教學質量的重要性。 隨著AI技術的發展,影像處理領域與自然語言領域的應用也逐漸成熟,本論文擬透過人體關鍵點辨識、動作比對、動作辨識等技術,建構一套隨時隨地均可存取,且不受設備等級及師資限制的臺灣手語單詞教學系統,向更多的人推廣及普及臺灣手語,為聽人與聽語障者順利溝通與互相理解打下基礎,打破隔閡,實現共融社會。 本論文設計之系統藉由人體關鍵點辨識的技術,獲取使用者的臺灣手語單詞動作關鍵點時間序列,使用自訓練的手語單詞評分模型,對使用者的臺灣手語單詞動作進行評分;再來利用自訓練的兩階段手語單詞動作分類模型及自設計的細部錯誤提醒演算法,找出使用者在學習臺灣手語單詞動作時所犯的錯誤點,為使用者提供專屬個人的學習建議與個人複習清單;另外,本論文使用基於網頁平台的教學系統,藉由此系統,使用者能夠隨時存取臺灣手語單詞的教學資源,並檢視自己的學習進度與成果。 實驗結果顯示,本論文系統所用之方法能夠有效對臺灣手語單詞進行相似度比對及對臺灣手語單詞動作影像進行分類。 |
英文摘要 |
The purpose of the present Convention is to promote, protect and ensure the full and equal enjoyment of all human rights and fundamental freedoms by all persons with disabilities, and to promote respect for their inherent dignity. However, deaf individuals who primarily communicate through sign language encounter communication barriers when interacting with hearing individuals. These barriers hinder mutual understanding, respect, and the realization of an inclusive society. Moreover, the complexity and limited availability of sign language education significant obstacles in achieving this goal. The proposed 12-year national education revision in Taiwan includes mandatory language courses for first- and second-year students. These courses encompass Taiwan Sign Language, along with other indigenous languages (students can choose between indigenous languages, Taiwan Sign Language, and languages of new immigrants). Additionally, third-year and high school students are given the option to study Taiwan Sign Language. This policy aims to popularize Taiwan Sign Language while emphasizing the importance of enhancing its instructional quality. The applications of AI technology in Image Processing and Natural Language Processing have made significant advancements. This paper aims to develop a teaching and scoring system for Taiwan Sign Language, which can be accessed anytime and anywhere without being constrained by device capabilities or the availability of qualified instructors. The primary objective is to promote and popularize Taiwan Sign Language among a broader audience, fostering smooth communication and mutual understanding between individuals with hearing impairments and those without. By breaking down barriers, this system strives for an inclusive society. The system designed in this paper employs various techniques to capture and evaluate users' Taiwan Sign Language gestures. It utilizes Human Keypoint Dection to capture temporal sequences of hands movements, a self-trained sign language scoring model to assess gesture performance, and a two-stage self-trained sign language gesture classification model along with a fine-grained error correction algorithm to identify users' errors. The system also provides personalized learning recommendations and review lists, accessed through a web-based platform that allows users to access teaching resources anytime and track their study progress and achievements. Experimental results indicate that the methods employed in this paper's system effectively enhance the matching of Taiwan Sign Language and the classification of gesture associated with Taiwan Sign Language. |
第三語言摘要 | |
論文目次 |
目錄 目錄 VI 圖目錄 VIII 表目錄 XI 第一章 簡介 1 第二章 相關研究 6 2-1 輸入前處理 6 2-2 動作相似度比對 6 2-3 動作辨識 7 2-4 手語辨識 8 第三章 背景知識 11 3-1 人體關鍵點辨識 11 3-1-1 MediaPipe 11 3-2 臺灣手語單詞動作評分 14 3-2-1 孿生神經網路 14 3-2-2 長短期記憶 15 3-3 臺灣手語動作混淆提醒 16 3-3-1 分類模型 16 3-3-2 ConvLSTM 16 3-3-3 動態時間規整演算法 17 第四章 系統架構 19 4-1 情境與問題描述 19 4-2 目標 19 4-3 系統架構 20 第五章 技術細節 33 第六章 實驗分析 54 6-1 實驗環境 54 6-1-1 深度學習模型的訓練與實驗環境 54 6-1-2 本論文系統教學平台功能開發環境 54 6-2 數據集 55 6-3 實驗設計與結果 55 第七章 結論 76 參考文獻 77 圖目錄 圖 1 Mediaipie臉部偵測示意圖 12 圖 2 人臉網格辨識關鍵點示意圖 13 圖 3 身體辨識關鍵點示意圖(取自MediaPipe官方網頁) 13 圖 4 孿生神經網路架構示意圖 14 圖 5 長短期記憶神經單元架構圖 15 圖 6 ConvLSTM架構示意圖(取自[18]) 17 圖 7 本系統基於人工智慧技術之方法 20 圖 8 系統架構流程圖 21 圖 9 人體關鍵點示意圖(取自MediaPipe官方網頁) 24 圖 10 缺失值填補演算法示意圖 24 圖 11 時間正規化步驟示意圖 25 圖 12 視圖角度異常示意圖 26 圖 13 滑動視窗取樣演算法 27 圖 14 基於兩階段ConvLSTM模型組的手語動作分類法 29 圖 15 細部錯誤示意圖 29 圖 16 學習歷程頁面截圖 32 圖 17 資料擴增演算法示意圖 35 圖 18 手語單詞動作評分模型數據集標籤法 36 圖 19 不完整手語動作標籤法 36 圖 20 前處理階段流程圖 37 圖 21 確認影像中有足夠空間比出臺灣手語動作之方法 38 圖 22 對齊演算法 42 圖 23 手語單詞動作評分模型—訓練期 44 圖 24 手語單詞動作評分模型—使用期 46 圖 25 手語粗分類模型—訓練期 47 圖 26 手語細分類模型—訓練期 48 圖 27 手語細分類模型訓練集 49 圖 28 兩階段的手語單詞分類模型組—使用期 50 圖 29 Ssampling分拆示意圖 51 圖 30 取樣間隔時間步對於本論文系統性能之影響 58 圖 31 手語單詞動作評分模型深度及門檻值實驗—精確率 59 圖 32 手語單詞動作評分模型深度及門檻值實驗—召回率 60 圖 33 手語單詞動作評分模型深度及門檻值實驗—F1分數 60 圖 34 手語單詞評分任務ROC曲線分析圖 62 圖 35 訓練集大小及類別數對手語單詞評分任務之影響—精確率 63 圖 36 訓練集大小及類別數對手語單詞評分任務之影響—召回率 64 圖 37 訓練集大小及類別數對手語單詞評分任務之影響—F1分數 65 圖 38 手語粗分類模型混淆矩陣特徵圖 70 圖 39 訓練集大小及類別數對手語單詞分類任務之影響—精確率 73 圖 40 訓練集大小及類別數對手語單詞分類任務之影響—召回率 74 圖 41 訓練集大小及類別數對手語單詞分類任務之影響—F1分數 75 表目錄 表 1 相關研究比較 10 表 2 人體關鍵點辨識格式比較 10 表 3 實驗環境設備資訊 54 表 4 數據集分佈 55 表 5 基礎混淆矩陣之定義 56 表 6 多分類混淆矩陣之定義(以類別1為例) 66 表 7 多分類混淆矩陣之定義(以類別2為例) 67 表 8 手語單詞動作分類模型深度實驗結果表 69 表 9 手語粗分類模型—各類別評價指標評估表 71 表 10 手語粗分類模型—整體模型評價指標評估表 71 表 11 細分類模型對於指定類別之性能提升比較表格 72 |
參考文獻 |
[1] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, "Realtime multi-person 2d pose estimation using part affinity fields," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7291-7299. [2] T. Simon, H. Joo, I. Matthews, and Y. Sheikh, "Hand keypoint detection in single images using multiview bootstrapping," in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2017, pp. 1145-1153. [3] C. Lugaresi et al., "Mediapipe: A framework for building perception pipelines," arXiv preprint arXiv:1906.08172, 2019. [4] F. Zhang et al., "Mediapipe hands: On-device real-time hand tracking," arXiv preprint arXiv:2006.10214, 2020. [5] A. Ullah, K. Muhammad, K. Haydarov, I. U. Haq, M. Lee, and S. W. Baik, "One-shot learning for surveillance anomaly recognition using siamese 3d cnn," in 2020 International Joint Conference on Neural Networks (IJCNN), 2020: IEEE, pp. 1-8. [6] S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural computation, vol. 9, no. 8, pp. 1735-1780, 1997. [7] C. Dai, X. Liu, and J. Lai, "Human action recognition using two-stream attention based LSTM networks," Applied soft computing, vol. 86, p. 105820, 2020. [8] A. Mittal, P. Kumar, P. P. Roy, R. Balasubramanian, and B. B. Chaudhuri, "A modified LSTM model for continuous sign language recognition using leap motion," IEEE Sensors Journal, vol. 19, no. 16, pp. 7056-7063, 2019. [9] N. Heidari and A. Iosifidis, "Temporal attention-augmented graph convolutional network for efficient skeleton-based human action recognition," in 2020 25th International Conference on Pattern Recognition (ICPR), 2021: IEEE, pp. 7907-7914. [10] T. N. Kipf and M. Welling, "Semi-supervised classification with graph convolutional networks," arXiv preprint arXiv:1609.02907, 2016. [11] Y.-F. Song, Z. Zhang, C. Shan, and L. Wang, "Constructing stronger and faster baselines for skeleton-based action recognition," IEEE transactions on pattern analysis and machine intelligence, vol. 45, no. 2, pp. 1474-1488, 2022. [12] A. Sanchez-Caballero, D. Fuentes-Jimenez, and C. Losada-Gutiérrez, "Exploiting the convlstm: Human action recognition using raw depth video-based recurrent neural networks," arXiv preprint arXiv:2006.07744, 2020. [13] P. Zhang, C. Lan, W. Zeng, J. Xing, J. Xue, and N. Zheng, "Semantics-guided neural networks for efficient skeleton-based human action recognition," in proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 1112-1121. [14] M. M. Varghese, S. Ramesh, S. Kadham, V. Dhruthi, and P. Kanwal, "Real-time Fitness Activity Recognition and Correction using Deep Neural Networks," in 2023 57th Annual Conference on Information Sciences and Systems (CISS), 2023: IEEE, pp. 1-6. [15] C.-B. Lin, Z. Dong, W.-K. Kuan, and Y.-F. Huang, "A framework for fall detection based on OpenPose skeleton and LSTM/GRU models," Applied Sciences, vol. 11, no. 1, p. 329, 2020. [16] M. Li, S. Chen, X. Chen, Y. Zhang, Y. Wang, and Q. Tian, "Symbiotic graph neural networks for 3d skeleton-based human action recognition and motion prediction," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 6, pp. 3316-3333, 2021. [17] P. Zhang, C. Lan, J. Xing, W. Zeng, J. Xue, and N. Zheng, "View adaptive neural networks for high performance skeleton-based human action recognition," IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 8, pp. 1963-1978, 2019. [18] X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo, "Convolutional LSTM network: A machine learning approach for precipitation nowcasting," Advances in neural information processing systems, vol. 28, 2015. [19] M. Müller, "Dynamic time warping," Information retrieval for music and motion, pp. 69-84, 2007. |
論文全文使用權限 |
如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信