§ 瀏覽學位論文書目資料
系統識別號 U0002-0107202410294000
DOI 10.6846/tku202400408
論文名稱(中文) 基於深度學習的手語識別和評分系統
論文名稱(英文) Deep Learning Based Sign Language Recognition and Scoring Systems
第三語言論文名稱
校院名稱 淡江大學
系所名稱(中文) 資訊工程學系博士班
系所名稱(英文) Department of Computer Science and Information Engineering
外國學位學校名稱
外國學位學院名稱
外國學位研究所名稱
學年度 112
學期 2
出版年 113
研究生(中文) 莊進智
研究生(英文) Christopher Chuang
ORCID 0009-0007-0836-2742
學號 810410018
學位類別 博士
語言別 英文
第二語言別
口試日期 2024-06-06
論文頁數 55頁
口試委員 指導教授 - 張志勇(cychang@mail.tku.edu.tw)
口試委員 - 廖文華
口試委員 - 武士戎
口試委員 - 石貴平
口試委員 - 蒯思齊
共同指導教授 - 郭經華(chkuo@mail.tku.edu.tw)
關鍵字(中) 卷積型長短期記憶網絡
深度學習
孿生長短期記憶網絡
手語識別
關鍵字(英) ConvLSTM
Deep Learning
Siamese LSTM
Sign language recognition
第三語言關鍵字
學科別分類
中文摘要
根據世界衛生組織(WHO)[1],全球超過5%的人口需要對他們的聽力障礙進行康復和援助。手語是聽障和聾啞社群的主要交流方式,但現有的手語識別和教學產品的效果有限。現有模型在識別手語的複雜語義上下文和細微手指動作方面面臨挑戰。本論文提出了一個手語教學和評分系統(STSS),結合了Siamese長短期記憶(LSTM)進行粗粒度分類和卷積LSTM(ConvLSTM)進行細粒度分類。Siamese LSTM分析經過時間和空間規範化預處理的關鍵點數據,快速計算樣本視頻與標準化視頻數據集之間的相似性。ConvLSTM對相似性結果高於某一門檻的數據點進行進一步分析。本論文所提出的STSS,與其他機制進行比較,在精確率、召回率和F1-Score方面均表現出色。
英文摘要
According to the World Health Organization (WHO) [1], over 5% of the global population requires assistance for their hearing loss. Sign language is the primary communication method for the deaf community, but recognition technologies are limited in their effectiveness. Existing models are challenged by the complex contextual relationships of sign language gestures and recognition of subtle finger movements. This dissertation proposed a Sign Language Teaching and Scoring System (STSS) which combines coarse-grained classification using Siamese Long Short-Term Memory (LSTM) and a fine-grained classification utilizing the Convolutional LSTM (ConvLSTM) model. The Siamese LSTM analyzes spatially and temporally normalized key point data, and quickly calculates the similarity between sample and standard sign language videos. It utilizes an adaptive contrastive loss function that dynamically adjusts according to similarity measures. The contrastive loss function helps the model focus on more challenging gestures that are very similar, but distinct. The ConvLSTM conducts further analysis on datapoints where similarity results rise above a certain threshold. The proposed STSS is then compared with other mechanisms, showing outperformance with respect to precision, recall, and F1-Score.
第三語言摘要
論文目次
Outline
Outline	IV
List of Figures	VI
Chapter 1. Introduction	1
1.1	Research Goals	3
1.2	Organization of the Dissertation	5
Chapter 2. Related work	6
2.1	Machine Learning	6
2.1.1	Hidden Markov Model (HMM)	6
2.1.2	K-nearest neighbor (KNN)	7
2.1.3	Support Vector Machine (SVM)	8
2.2	Deep Learning	9
2.2.1	Convolutional Neural Network (CNN)	9
2.2.2	Graph Convolutional Network (GCN)	11
2.2.3	Long short-term memory (LSTM)	12
2.2.4	Hybrid networks	13
2.2.5	Principal Component Analysis Network (PCANet)	16
Chapter 3. Preliminary	18
3.1	MediaPipe for Key Point Recognition	18
3.2	Siamese Neural Network Architecture	20
3.3	Long Short-Term Memory (LSTM) Network	20
3.4	Convolutional LSTM (ConvLSTM) Network	22
Chapter 4. Notations, Assumptions, Problem Description	24
4.1	Notations and Assumptions	24
4.2	Problem Description	24
4.3	Objective	26
Chapter 5. The Proposed Sign Language Teaching and Scoring System (STSS)	29
5.1	Data Preprocessing	30
5.1.1	Input Video Segmentation	30
5.1.2	Key Point Extraction	32
5.1.3	Temporal Normalization	32
5.1.4	Spatial Normalization	33
5.2	Coarse-Grained Classification using Siamese LSTM Model	34
5.3	Fine-Grained Classification using ConvLSTM Model	37
5.4	Summary	40
Chapter 6. Performance Evaluation	41
6.1	Dataset	41
6.2	Simulation Results	42
6.3	Summary	51
Chapter 7. Conclusion and Future Work	52
References	53

List of Figures
Fig. 3.1. Key point coordinates extracted for each hand with Mediapipe	19
Fig. 3.2. Key point coordinates extracted for the face and upper body	19
Fig. 3.3. LSTM Cell Structure	21
Fig. 3.4. Convolutional kernel operations over an image	23
Fig. 5.1. The architecture of proposed STSS mechanism	29
Fig. 5.2. Input Video Segmentation process	31
Fig. 5.3. Architecture of Siamese LSTM	35
Fig. 5.4. Architecture of ConvLSTM	38
Fig. 6.1. Training set distribution	41
Fig. 6.2. Testing set distribution	42
Fig. 6.3. Impact of sampling frame interval on accuracy and average classification time	44
Fig. 6.4. Confusion Matrix of each sign language category	45
Fig. 6.5. Varying threshold and layer counts in relation to recall, precision, and F1-Score	47
Fig. 6.6. ROC Curves for proposed STSS and TAM models	48
Fig. 6.7. Comparison of the proposed STSS, ML-CNN, and TAM in terms of precision, recall, and F1-Score	50
參考文獻
References
[1]	World Health Organization, “Deafness and Hearing Loss,” World Health Organization, https://www.who.int/news-room/fact-sheets/detail/deafness-and-hearing-loss, 2024.
[2]	Kudrinko, Karly, et al. "Wearable sensor-based sign language recognition: A comprehensive review." IEEE Reviews in Biomedical Engineering, vol. 14, pp. 82-97, 2020.
[3]	L. E. Baum and T. Petrie, “Statistical inference for probabilistic functions of finite state Markov chains,” Ann. Math. Statistics, vol. 37, no. 6, pp. 1554–1563, 1966.
[4]	T. Starner, J. Weaver, and A. Pentland, “Real-time American sign language recognition using desk and wearable computer based video,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 12, pp. 1371–1375, 1998.
[5]	H.-L. Lou, “Implementing the Viterbi algorithm,” Proc. IEEE Signal Process. Mag., pp. 42–52, 1995.
[6]	X. Liu et al., “3D skeletal gesture recognition via hidden states exploration,” IEEE Trans. Image Process., vol. 29, pp. 4583–4597, 2020.
[7]	G. Fang, W. Gao, X. Chen, C. Wang, and J. Ma, ‘‘Signer-independent 843 continuous sign language recognition based on SRN/HMM,’’ Proc. Int. 844 Gesture Workshop, pp. 76–85, 2001.
[8]	Rung-Huei Liang and Ming Ouhyoung, "A real-time continuous gesture recognition system for sign language," Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition, Nara, Japan, pp. 558-567, 1998.
[9]	N. Tubaiz, T. Shanableh, and K. Assaleh, ‘‘Glove-based continuous Arabic 865 sign language recognition in user-dependent mode,’’ IEEE Trans. Human- 866 Mach. Syst., vol. 45, no. 4, pp. 526–533, 2015.
[10]	E. Alpaydin, Introduction to Machine Learning, MIT Press, 2010.
[11]	J. Wu, L. Sun, and R. Jafari, “A wearable system for recognizing American sign language in real-time using IMU and surface EMG sensors,” IEEE J. Biomed. Heal. Informat., vol. 20, no. 5, pp. 1281–1290, 2016.
[12]	W. Aly, S. Aly ,and S. Almotairi, ‘‘User-independent American Sign Language Alphabet Recognition based on Depth Image and PCANet features,’’ IEEE Access, vol. 7, pp. 123138–123150, 2019.
[13]	Luqman, Hamzah. "An efficient two-stream network for isolated sign language recognition using accumulative video motion." IEEE Access, vol. 10, pp. 93785-93798, 2022.
[14]	O. Koller, N. C. Camgoz, H. Ney, and R. Bowden, “Weakly Supervised Learning with Multi-Stream CNN-LSTM-HMMs to Discover Sequential Parallelism in Sign Language Videos,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 9, pp. 2306–2320, 2020.
[15]	J. Forster, C. Schmidt, T. Hoyoux, O. Koller, U. Zelle, J. Piater, and H. Ney, “RWTH-PHOENIX-Weather: A large vocabulary sign language recognition and translation corpus,” Proc. Int. Conf. Language Resources Eval., pp. 3785–3789, 2012.
[16]	Lin, K., Wang, X., Zhu, L., Zhang, B. and Yang, Y., “SKIM: Skeleton-Based Isolated Sign Language Recognition With Part Mixing,” IEEE Transactions on Multimedia, vol. 26, pp.4271-4280, 2024.
[17]	J. Huang, W. Zhou, H. Li, and W. Li, ‘‘Attention-based 3D-CNNs for large-vocabulary sign language recognition,’’ IEEE Trans. Circuits Syst. Video Technol., vol. 29, no. 9, pp. 2822–2832, 2019.
[18]	M. Al-Hammadi, G. Muhammad, W. Abdul, M. Alsulaiman, M. A. Bencherif and M. A. Mekhtiche, "Hand Gesture Recognition for Sign Language Using 3DCNN," IEEE Access, vol. 8, pp. 79491-79509, 2020.
[19]	Wang, Zhibo, et al. "Hear sign language: A real-time end-to-end sign language recognition system," IEEE Transactions on Mobile Computing, vol. 21, no. 7, pp. 2398-2410, 2022.
[20]	Bencherif, Mohamed A., et al., "Arabic sign language recognition system using 2D hands and body skeleton data," IEEE Access, vol. 9, pp. 59612-59627, 2021.
[21]	Hao Zhou, Wengang Zhou, Yun Zhou, Houqiang Li, "Spatial-temporal multi-cue network for sign language recognition and translation," IEEE Transactions on Multimedia, vol. 24, pp. 768-779, 2021.
[22]	A. Graves and J. Schmidhuber, “Framewise phoneme classification with bidirectional LSTM and other neural network architectures,” Neural Netw., vol. 18, no. 5–6, pp. 602–610, 2005.
[23]	K. Cho et al., “Learning phrase representations using RNN encoder- decoder for statistical machine translation,” Proc. Conf. Empir. Meth- ods Nat. Lang. Process., pp. 1724–1734, 2014.
[24]	B. Fang, J. Co, and M. Zhang, ‘‘DeepASL: Enabling ubiquitous and non- intrusive word and sentence-level sign language translation,’’ Proc. 15th ACM Conf. Embedded Netw. Sensor Syst., pp. 1–13, 2017.
[25]	E. Rakun, A. M. Arymurthy, L. Y. Stefanus, A. F. Wicaksono, and I. W. W. Wisesa, ‘‘Recognition of sign language system for Indonesian language using long short-term memory neural networks,’’ Adv. Sci. Lett., vol. 24, no. 2, pp. 999–1004, 2018.
[26]	N. Heidari, and Iosifidis, “Temporal attention-augmented graph convolutional network for efficient skeleton-based human action recognition,” 25th IEEE International Conference on Pattern Recognition, Milan, Italy, pp. 7907-7914, 2021.
[27]	G A. Prasath, and K. Annapurani, “Prediction of sign language recognition based on multi layered CNN,” Multimedia Tools and Applications, vol. 82, no. 19, pp. 29649-29669, 2023.
[28]	Yang, Ti et al., “Articulated pose estimation with flexible mixtures-of-parts,” Proc. CVPR, pp.1385-1392, 2011.
[29]	Basavarajaiah, Madhushree, “6 basic things to know about Convolution,” Medium.com, https://medium.com/@bdhuma/6-basic-things-to-know-about-convolution-daef5e1bc411, 2019.
[30]	Pisa, Ivan et al., “Denoising Autoencoders and LSTM-Based Artificial Neural Networks Data Processing for Its Application to Internal Model Control in Industrial Environments—The Wastewater Treatment Plant Control Case,” Sensors, vol. 20, no. 13, p. 3743, 2020.
[31]	Google AI,“MediaPipe Solutions Guide,” Google AI for Developers, https://ai.google.dev/edge/mediapipe/solutions/vision/hand_landmarker, 2024.
[32]	Google AI,“MediaPipe Solutions Guide,” Google AI for Developers,  https://ai.google.dev/edge/mediapipe/solutions/vision/pose_landmarker, 2024.
論文全文使用權限
國家圖書館
不同意無償授權國家圖書館
校內
校內紙本論文立即公開
電子論文全文不同意授權
校內書目立即公開
校外
不同意授權
校外書目立即公開

如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信