系統識別號 | U0002-0107202410294000 |
---|---|
DOI | 10.6846/tku202400408 |
論文名稱(中文) | 基於深度學習的手語識別和評分系統 |
論文名稱(英文) | Deep Learning Based Sign Language Recognition and Scoring Systems |
第三語言論文名稱 | |
校院名稱 | 淡江大學 |
系所名稱(中文) | 資訊工程學系博士班 |
系所名稱(英文) | Department of Computer Science and Information Engineering |
外國學位學校名稱 | |
外國學位學院名稱 | |
外國學位研究所名稱 | |
學年度 | 112 |
學期 | 2 |
出版年 | 113 |
研究生(中文) | 莊進智 |
研究生(英文) | Christopher Chuang |
ORCID | 0009-0007-0836-2742 |
學號 | 810410018 |
學位類別 | 博士 |
語言別 | 英文 |
第二語言別 | |
口試日期 | 2024-06-06 |
論文頁數 | 55頁 |
口試委員 |
指導教授
-
張志勇(cychang@mail.tku.edu.tw)
口試委員 - 廖文華 口試委員 - 武士戎 口試委員 - 石貴平 口試委員 - 蒯思齊 共同指導教授 - 郭經華(chkuo@mail.tku.edu.tw) |
關鍵字(中) |
卷積型長短期記憶網絡 深度學習 孿生長短期記憶網絡 手語識別 |
關鍵字(英) |
ConvLSTM Deep Learning Siamese LSTM Sign language recognition |
第三語言關鍵字 | |
學科別分類 | |
中文摘要 |
根據世界衛生組織(WHO)[1],全球超過5%的人口需要對他們的聽力障礙進行康復和援助。手語是聽障和聾啞社群的主要交流方式,但現有的手語識別和教學產品的效果有限。現有模型在識別手語的複雜語義上下文和細微手指動作方面面臨挑戰。本論文提出了一個手語教學和評分系統(STSS),結合了Siamese長短期記憶(LSTM)進行粗粒度分類和卷積LSTM(ConvLSTM)進行細粒度分類。Siamese LSTM分析經過時間和空間規範化預處理的關鍵點數據,快速計算樣本視頻與標準化視頻數據集之間的相似性。ConvLSTM對相似性結果高於某一門檻的數據點進行進一步分析。本論文所提出的STSS,與其他機制進行比較,在精確率、召回率和F1-Score方面均表現出色。 |
英文摘要 |
According to the World Health Organization (WHO) [1], over 5% of the global population requires assistance for their hearing loss. Sign language is the primary communication method for the deaf community, but recognition technologies are limited in their effectiveness. Existing models are challenged by the complex contextual relationships of sign language gestures and recognition of subtle finger movements. This dissertation proposed a Sign Language Teaching and Scoring System (STSS) which combines coarse-grained classification using Siamese Long Short-Term Memory (LSTM) and a fine-grained classification utilizing the Convolutional LSTM (ConvLSTM) model. The Siamese LSTM analyzes spatially and temporally normalized key point data, and quickly calculates the similarity between sample and standard sign language videos. It utilizes an adaptive contrastive loss function that dynamically adjusts according to similarity measures. The contrastive loss function helps the model focus on more challenging gestures that are very similar, but distinct. The ConvLSTM conducts further analysis on datapoints where similarity results rise above a certain threshold. The proposed STSS is then compared with other mechanisms, showing outperformance with respect to precision, recall, and F1-Score. |
第三語言摘要 | |
論文目次 |
Outline Outline IV List of Figures VI Chapter 1. Introduction 1 1.1 Research Goals 3 1.2 Organization of the Dissertation 5 Chapter 2. Related work 6 2.1 Machine Learning 6 2.1.1 Hidden Markov Model (HMM) 6 2.1.2 K-nearest neighbor (KNN) 7 2.1.3 Support Vector Machine (SVM) 8 2.2 Deep Learning 9 2.2.1 Convolutional Neural Network (CNN) 9 2.2.2 Graph Convolutional Network (GCN) 11 2.2.3 Long short-term memory (LSTM) 12 2.2.4 Hybrid networks 13 2.2.5 Principal Component Analysis Network (PCANet) 16 Chapter 3. Preliminary 18 3.1 MediaPipe for Key Point Recognition 18 3.2 Siamese Neural Network Architecture 20 3.3 Long Short-Term Memory (LSTM) Network 20 3.4 Convolutional LSTM (ConvLSTM) Network 22 Chapter 4. Notations, Assumptions, Problem Description 24 4.1 Notations and Assumptions 24 4.2 Problem Description 24 4.3 Objective 26 Chapter 5. The Proposed Sign Language Teaching and Scoring System (STSS) 29 5.1 Data Preprocessing 30 5.1.1 Input Video Segmentation 30 5.1.2 Key Point Extraction 32 5.1.3 Temporal Normalization 32 5.1.4 Spatial Normalization 33 5.2 Coarse-Grained Classification using Siamese LSTM Model 34 5.3 Fine-Grained Classification using ConvLSTM Model 37 5.4 Summary 40 Chapter 6. Performance Evaluation 41 6.1 Dataset 41 6.2 Simulation Results 42 6.3 Summary 51 Chapter 7. Conclusion and Future Work 52 References 53 List of Figures Fig. 3.1. Key point coordinates extracted for each hand with Mediapipe 19 Fig. 3.2. Key point coordinates extracted for the face and upper body 19 Fig. 3.3. LSTM Cell Structure 21 Fig. 3.4. Convolutional kernel operations over an image 23 Fig. 5.1. The architecture of proposed STSS mechanism 29 Fig. 5.2. Input Video Segmentation process 31 Fig. 5.3. Architecture of Siamese LSTM 35 Fig. 5.4. Architecture of ConvLSTM 38 Fig. 6.1. Training set distribution 41 Fig. 6.2. Testing set distribution 42 Fig. 6.3. Impact of sampling frame interval on accuracy and average classification time 44 Fig. 6.4. Confusion Matrix of each sign language category 45 Fig. 6.5. Varying threshold and layer counts in relation to recall, precision, and F1-Score 47 Fig. 6.6. ROC Curves for proposed STSS and TAM models 48 Fig. 6.7. Comparison of the proposed STSS, ML-CNN, and TAM in terms of precision, recall, and F1-Score 50 |
參考文獻 |
References [1] World Health Organization, “Deafness and Hearing Loss,” World Health Organization, https://www.who.int/news-room/fact-sheets/detail/deafness-and-hearing-loss, 2024. [2] Kudrinko, Karly, et al. "Wearable sensor-based sign language recognition: A comprehensive review." IEEE Reviews in Biomedical Engineering, vol. 14, pp. 82-97, 2020. [3] L. E. Baum and T. Petrie, “Statistical inference for probabilistic functions of finite state Markov chains,” Ann. Math. Statistics, vol. 37, no. 6, pp. 1554–1563, 1966. [4] T. Starner, J. Weaver, and A. Pentland, “Real-time American sign language recognition using desk and wearable computer based video,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 12, pp. 1371–1375, 1998. [5] H.-L. Lou, “Implementing the Viterbi algorithm,” Proc. IEEE Signal Process. Mag., pp. 42–52, 1995. [6] X. Liu et al., “3D skeletal gesture recognition via hidden states exploration,” IEEE Trans. Image Process., vol. 29, pp. 4583–4597, 2020. [7] G. Fang, W. Gao, X. Chen, C. Wang, and J. Ma, ‘‘Signer-independent 843 continuous sign language recognition based on SRN/HMM,’’ Proc. Int. 844 Gesture Workshop, pp. 76–85, 2001. [8] Rung-Huei Liang and Ming Ouhyoung, "A real-time continuous gesture recognition system for sign language," Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition, Nara, Japan, pp. 558-567, 1998. [9] N. Tubaiz, T. Shanableh, and K. Assaleh, ‘‘Glove-based continuous Arabic 865 sign language recognition in user-dependent mode,’’ IEEE Trans. Human- 866 Mach. Syst., vol. 45, no. 4, pp. 526–533, 2015. [10] E. Alpaydin, Introduction to Machine Learning, MIT Press, 2010. [11] J. Wu, L. Sun, and R. Jafari, “A wearable system for recognizing American sign language in real-time using IMU and surface EMG sensors,” IEEE J. Biomed. Heal. Informat., vol. 20, no. 5, pp. 1281–1290, 2016. [12] W. Aly, S. Aly ,and S. Almotairi, ‘‘User-independent American Sign Language Alphabet Recognition based on Depth Image and PCANet features,’’ IEEE Access, vol. 7, pp. 123138–123150, 2019. [13] Luqman, Hamzah. "An efficient two-stream network for isolated sign language recognition using accumulative video motion." IEEE Access, vol. 10, pp. 93785-93798, 2022. [14] O. Koller, N. C. Camgoz, H. Ney, and R. Bowden, “Weakly Supervised Learning with Multi-Stream CNN-LSTM-HMMs to Discover Sequential Parallelism in Sign Language Videos,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 9, pp. 2306–2320, 2020. [15] J. Forster, C. Schmidt, T. Hoyoux, O. Koller, U. Zelle, J. Piater, and H. Ney, “RWTH-PHOENIX-Weather: A large vocabulary sign language recognition and translation corpus,” Proc. Int. Conf. Language Resources Eval., pp. 3785–3789, 2012. [16] Lin, K., Wang, X., Zhu, L., Zhang, B. and Yang, Y., “SKIM: Skeleton-Based Isolated Sign Language Recognition With Part Mixing,” IEEE Transactions on Multimedia, vol. 26, pp.4271-4280, 2024. [17] J. Huang, W. Zhou, H. Li, and W. Li, ‘‘Attention-based 3D-CNNs for large-vocabulary sign language recognition,’’ IEEE Trans. Circuits Syst. Video Technol., vol. 29, no. 9, pp. 2822–2832, 2019. [18] M. Al-Hammadi, G. Muhammad, W. Abdul, M. Alsulaiman, M. A. Bencherif and M. A. Mekhtiche, "Hand Gesture Recognition for Sign Language Using 3DCNN," IEEE Access, vol. 8, pp. 79491-79509, 2020. [19] Wang, Zhibo, et al. "Hear sign language: A real-time end-to-end sign language recognition system," IEEE Transactions on Mobile Computing, vol. 21, no. 7, pp. 2398-2410, 2022. [20] Bencherif, Mohamed A., et al., "Arabic sign language recognition system using 2D hands and body skeleton data," IEEE Access, vol. 9, pp. 59612-59627, 2021. [21] Hao Zhou, Wengang Zhou, Yun Zhou, Houqiang Li, "Spatial-temporal multi-cue network for sign language recognition and translation," IEEE Transactions on Multimedia, vol. 24, pp. 768-779, 2021. [22] A. Graves and J. Schmidhuber, “Framewise phoneme classification with bidirectional LSTM and other neural network architectures,” Neural Netw., vol. 18, no. 5–6, pp. 602–610, 2005. [23] K. Cho et al., “Learning phrase representations using RNN encoder- decoder for statistical machine translation,” Proc. Conf. Empir. Meth- ods Nat. Lang. Process., pp. 1724–1734, 2014. [24] B. Fang, J. Co, and M. Zhang, ‘‘DeepASL: Enabling ubiquitous and non- intrusive word and sentence-level sign language translation,’’ Proc. 15th ACM Conf. Embedded Netw. Sensor Syst., pp. 1–13, 2017. [25] E. Rakun, A. M. Arymurthy, L. Y. Stefanus, A. F. Wicaksono, and I. W. W. Wisesa, ‘‘Recognition of sign language system for Indonesian language using long short-term memory neural networks,’’ Adv. Sci. Lett., vol. 24, no. 2, pp. 999–1004, 2018. [26] N. Heidari, and Iosifidis, “Temporal attention-augmented graph convolutional network for efficient skeleton-based human action recognition,” 25th IEEE International Conference on Pattern Recognition, Milan, Italy, pp. 7907-7914, 2021. [27] G A. Prasath, and K. Annapurani, “Prediction of sign language recognition based on multi layered CNN,” Multimedia Tools and Applications, vol. 82, no. 19, pp. 29649-29669, 2023. [28] Yang, Ti et al., “Articulated pose estimation with flexible mixtures-of-parts,” Proc. CVPR, pp.1385-1392, 2011. [29] Basavarajaiah, Madhushree, “6 basic things to know about Convolution,” Medium.com, https://medium.com/@bdhuma/6-basic-things-to-know-about-convolution-daef5e1bc411, 2019. [30] Pisa, Ivan et al., “Denoising Autoencoders and LSTM-Based Artificial Neural Networks Data Processing for Its Application to Internal Model Control in Industrial Environments—The Wastewater Treatment Plant Control Case,” Sensors, vol. 20, no. 13, p. 3743, 2020. [31] Google AI,“MediaPipe Solutions Guide,” Google AI for Developers, https://ai.google.dev/edge/mediapipe/solutions/vision/hand_landmarker, 2024. [32] Google AI,“MediaPipe Solutions Guide,” Google AI for Developers, https://ai.google.dev/edge/mediapipe/solutions/vision/pose_landmarker, 2024. |
論文全文使用權限 |
如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信