電子學位論文服務

§ 瀏覽學位論文書目資料

本論文紙本於2024-07-22起公開使用

系統識別號	U0002-0107202410294000
DOI	10.6846/tku202400408
論文名稱(中文)	基於深度學習的手語識別和評分系統
論文名稱(英文)	Deep Learning Based Sign Language Recognition and Scoring Systems
第三語言論文名稱
校院名稱	淡江大學
系所名稱(中文)	資訊工程學系博士班
系所名稱(英文)	Department of Computer Science and Information Engineering
外國學位學校名稱
外國學位學院名稱
外國學位研究所名稱
學年度	112
學期	2
出版年	113
研究生(中文)	莊進智
研究生(英文)	Christopher Chuang
ORCID	0009-0007-0836-2742
學號	810410018
學位類別	博士
語言別	英文
第二語言別
口試日期	2024-06-06
論文頁數	55頁
口試委員	指導教授 - 張志勇(cychang@mail.tku.edu.tw) 口試委員 - 廖文華口試委員 - 武士戎口試委員 - 石貴平口試委員 - 蒯思齊共同指導教授 - 郭經華(chkuo@mail.tku.edu.tw)
關鍵字(中)	卷積型長短期記憶網絡深度學習孿生長短期記憶網絡手語識別
關鍵字(英)	ConvLSTM Deep Learning Siamese LSTM Sign language recognition
第三語言關鍵字
學科別分類
中文摘要	根據世界衛生組織（WHO）[1]，全球超過5%的人口需要對他們的聽力障礙進行康復和援助。手語是聽障和聾啞社群的主要交流方式，但現有的手語識別和教學產品的效果有限。現有模型在識別手語的複雜語義上下文和細微手指動作方面面臨挑戰。本論文提出了一個手語教學和評分系統（STSS），結合了Siamese長短期記憶（LSTM）進行粗粒度分類和卷積LSTM（ConvLSTM）進行細粒度分類。Siamese LSTM分析經過時間和空間規範化預處理的關鍵點數據，快速計算樣本視頻與標準化視頻數據集之間的相似性。ConvLSTM對相似性結果高於某一門檻的數據點進行進一步分析。本論文所提出的STSS，與其他機制進行比較，在精確率、召回率和F1-Score方面均表現出色。
英文摘要	According to the World Health Organization (WHO) [1], over 5% of the global population requires assistance for their hearing loss. Sign language is the primary communication method for the deaf community, but recognition technologies are limited in their effectiveness. Existing models are challenged by the complex contextual relationships of sign language gestures and recognition of subtle finger movements. This dissertation proposed a Sign Language Teaching and Scoring System (STSS) which combines coarse-grained classification using Siamese Long Short-Term Memory (LSTM) and a fine-grained classification utilizing the Convolutional LSTM (ConvLSTM) model. The Siamese LSTM analyzes spatially and temporally normalized key point data, and quickly calculates the similarity between sample and standard sign language videos. It utilizes an adaptive contrastive loss function that dynamically adjusts according to similarity measures. The contrastive loss function helps the model focus on more challenging gestures that are very similar, but distinct. The ConvLSTM conducts further analysis on datapoints where similarity results rise above a certain threshold. The proposed STSS is then compared with other mechanisms, showing outperformance with respect to precision, recall, and F1-Score.
第三語言摘要
論文目次	Outline Outline IV List of Figures VI Chapter 1. Introduction 1 1.1 Research Goals 3 1.2 Organization of the Dissertation 5 Chapter 2. Related work 6 2.1 Machine Learning 6 2.1.1 Hidden Markov Model (HMM) 6 2.1.2 K-nearest neighbor (KNN) 7 2.1.3 Support Vector Machine (SVM) 8 2.2 Deep Learning 9 2.2.1 Convolutional Neural Network (CNN) 9 2.2.2 Graph Convolutional Network (GCN) 11 2.2.3 Long short-term memory (LSTM) 12 2.2.4 Hybrid networks 13 2.2.5 Principal Component Analysis Network (PCANet) 16 Chapter 3. Preliminary 18 3.1 MediaPipe for Key Point Recognition 18 3.2 Siamese Neural Network Architecture 20 3.3 Long Short-Term Memory (LSTM) Network 20 3.4 Convolutional LSTM (ConvLSTM) Network 22 Chapter 4. Notations, Assumptions, Problem Description 24 4.1 Notations and Assumptions 24 4.2 Problem Description 24 4.3 Objective 26 Chapter 5. The Proposed Sign Language Teaching and Scoring System (STSS) 29 5.1 Data Preprocessing 30 5.1.1 Input Video Segmentation 30 5.1.2 Key Point Extraction 32 5.1.3 Temporal Normalization 32 5.1.4 Spatial Normalization 33 5.2 Coarse-Grained Classification using Siamese LSTM Model 34 5.3 Fine-Grained Classification using ConvLSTM Model 37 5.4 Summary 40 Chapter 6. Performance Evaluation 41 6.1 Dataset 41 6.2 Simulation Results 42 6.3 Summary 51 Chapter 7. Conclusion and Future Work 52 References 53 List of Figures Fig. 3.1. Key point coordinates extracted for each hand with Mediapipe 19 Fig. 3.2. Key point coordinates extracted for the face and upper body 19 Fig. 3.3. LSTM Cell Structure 21 Fig. 3.4. Convolutional kernel operations over an image 23 Fig. 5.1. The architecture of proposed STSS mechanism 29 Fig. 5.2. Input Video Segmentation process 31 Fig. 5.3. Architecture of Siamese LSTM 35 Fig. 5.4. Architecture of ConvLSTM 38 Fig. 6.1. Training set distribution 41 Fig. 6.2. Testing set distribution 42 Fig. 6.3. Impact of sampling frame interval on accuracy and average classification time 44 Fig. 6.4. Confusion Matrix of each sign language category 45 Fig. 6.5. Varying threshold and layer counts in relation to recall, precision, and F1-Score 47 Fig. 6.6. ROC Curves for proposed STSS and TAM models 48 Fig. 6.7. Comparison of the proposed STSS, ML-CNN, and TAM in terms of precision, recall, and F1-Score 50
參考文獻	References [1] World Health Organization, “Deafness and Hearing Loss,” World Health Organization, https://www.who.int/news-room/fact-sheets/detail/deafness-and-hearing-loss, 2024. [2] Kudrinko, Karly, et al. "Wearable sensor-based sign language recognition: A comprehensive review." IEEE Reviews in Biomedical Engineering, vol. 14, pp. 82-97, 2020. [3] L. E. Baum and T. Petrie, “Statistical inference for probabilistic functions of finite state Markov chains,” Ann. Math. Statistics, vol. 37, no. 6, pp. 1554–1563, 1966. [4] T. Starner, J. Weaver, and A. Pentland, “Real-time American sign language recognition using desk and wearable computer based video,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 12, pp. 1371–1375, 1998. [5] H.-L. Lou, “Implementing the Viterbi algorithm,” Proc. IEEE Signal Process. Mag., pp. 42–52, 1995. [6] X. Liu et al., “3D skeletal gesture recognition via hidden states exploration,” IEEE Trans. Image Process., vol. 29, pp. 4583–4597, 2020. [7] G. Fang, W. Gao, X. Chen, C. Wang, and J. Ma, ‘‘Signer-independent 843 continuous sign language recognition based on SRN/HMM,’’ Proc. Int. 844 Gesture Workshop, pp. 76–85, 2001. [8] Rung-Huei Liang and Ming Ouhyoung, "A real-time continuous gesture recognition system for sign language," Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition, Nara, Japan, pp. 558-567, 1998. [9] N. Tubaiz, T. Shanableh, and K. Assaleh, ‘‘Glove-based continuous Arabic 865 sign language recognition in user-dependent mode,’’ IEEE Trans. Human- 866 Mach. Syst., vol. 45, no. 4, pp. 526–533, 2015. [10] E. Alpaydin, Introduction to Machine Learning, MIT Press, 2010. [11] J. Wu, L. Sun, and R. Jafari, “A wearable system for recognizing American sign language in real-time using IMU and surface EMG sensors,” IEEE J. Biomed. Heal. Informat., vol. 20, no. 5, pp. 1281–1290, 2016. [12] W. Aly, S. Aly ,and S. Almotairi, ‘‘User-independent American Sign Language Alphabet Recognition based on Depth Image and PCANet features,’’ IEEE Access, vol. 7, pp. 123138–123150, 2019. [13] Luqman, Hamzah. "An efficient two-stream network for isolated sign language recognition using accumulative video motion." IEEE Access, vol. 10, pp. 93785-93798, 2022. [14] O. Koller, N. C. Camgoz, H. Ney, and R. Bowden, “Weakly Supervised Learning with Multi-Stream CNN-LSTM-HMMs to Discover Sequential Parallelism in Sign Language Videos,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 9, pp. 2306–2320, 2020. [15] J. Forster, C. Schmidt, T. Hoyoux, O. Koller, U. Zelle, J. Piater, and H. Ney, “RWTH-PHOENIX-Weather: A large vocabulary sign language recognition and translation corpus,” Proc. Int. Conf. Language Resources Eval., pp. 3785–3789, 2012. [16] Lin, K., Wang, X., Zhu, L., Zhang, B. and Yang, Y., “SKIM: Skeleton-Based Isolated Sign Language Recognition With Part Mixing,” IEEE Transactions on Multimedia, vol. 26, pp.4271-4280, 2024. [17] J. Huang, W. Zhou, H. Li, and W. Li, ‘‘Attention-based 3D-CNNs for large-vocabulary sign language recognition,’’ IEEE Trans. Circuits Syst. Video Technol., vol. 29, no. 9, pp. 2822–2832, 2019. [18] M. Al-Hammadi, G. Muhammad, W. Abdul, M. Alsulaiman, M. A. Bencherif and M. A. Mekhtiche, "Hand Gesture Recognition for Sign Language Using 3DCNN," IEEE Access, vol. 8, pp. 79491-79509, 2020. [19] Wang, Zhibo, et al. "Hear sign language: A real-time end-to-end sign language recognition system," IEEE Transactions on Mobile Computing, vol. 21, no. 7, pp. 2398-2410, 2022. [20] Bencherif, Mohamed A., et al., "Arabic sign language recognition system using 2D hands and body skeleton data," IEEE Access, vol. 9, pp. 59612-59627, 2021. [21] Hao Zhou, Wengang Zhou, Yun Zhou, Houqiang Li, "Spatial-temporal multi-cue network for sign language recognition and translation," IEEE Transactions on Multimedia, vol. 24, pp. 768-779, 2021. [22] A. Graves and J. Schmidhuber, “Framewise phoneme classification with bidirectional LSTM and other neural network architectures,” Neural Netw., vol. 18, no. 5–6, pp. 602–610, 2005. [23] K. Cho et al., “Learning phrase representations using RNN encoder- decoder for statistical machine translation,” Proc. Conf. Empir. Meth- ods Nat. Lang. Process., pp. 1724–1734, 2014. [24] B. Fang, J. Co, and M. Zhang, ‘‘DeepASL: Enabling ubiquitous and non- intrusive word and sentence-level sign language translation,’’ Proc. 15th ACM Conf. Embedded Netw. Sensor Syst., pp. 1–13, 2017. [25] E. Rakun, A. M. Arymurthy, L. Y. Stefanus, A. F. Wicaksono, and I. W. W. Wisesa, ‘‘Recognition of sign language system for Indonesian language using long short-term memory neural networks,’’ Adv. Sci. Lett., vol. 24, no. 2, pp. 999–1004, 2018. [26] N. Heidari, and Iosifidis, “Temporal attention-augmented graph convolutional network for efficient skeleton-based human action recognition,” 25th IEEE International Conference on Pattern Recognition, Milan, Italy, pp. 7907-7914, 2021. [27] G A. Prasath, and K. Annapurani, “Prediction of sign language recognition based on multi layered CNN,” Multimedia Tools and Applications, vol. 82, no. 19, pp. 29649-29669, 2023. [28] Yang, Ti et al., “Articulated pose estimation with flexible mixtures-of-parts,” Proc. CVPR, pp.1385-1392, 2011. [29] Basavarajaiah, Madhushree, “6 basic things to know about Convolution,” Medium.com, https://medium.com/@bdhuma/6-basic-things-to-know-about-convolution-daef5e1bc411, 2019. [30] Pisa, Ivan et al., “Denoising Autoencoders and LSTM-Based Artificial Neural Networks Data Processing for Its Application to Internal Model Control in Industrial Environments—The Wastewater Treatment Plant Control Case,” Sensors, vol. 20, no. 13, p. 3743, 2020. [31] Google AI,“MediaPipe Solutions Guide,” Google AI for Developers, https://ai.google.dev/edge/mediapipe/solutions/vision/hand_landmarker, 2024. [32] Google AI,“MediaPipe Solutions Guide,” Google AI for Developers, https://ai.google.dev/edge/mediapipe/solutions/vision/pose_landmarker, 2024.
論文全文使用權限	國家圖書館：不同意無償授權國家圖書館校內：校內紙本論文立即公開電子論文全文不同意授權校內書目立即公開校外：不同意授權校外書目立即公開

返回頁首

如有問題，歡迎洽詢！
圖書館數位資訊組　(02)2621-5656 轉 2487 或來信