§ 瀏覽學位論文書目資料
系統識別號 U0002-2108202412024200
DOI 10.6846/tku202400695
論文名稱(中文) 結合視覺、聽覺與語義分析於暴力行為識別的多模態研究
論文名稱(英文) A Multimodal Study on the Identification of Violent Behavior through Visual, Auditory, and Semantic Analysis
第三語言論文名稱
校院名稱 淡江大學
系所名稱(中文) 資訊工程學系碩士班
系所名稱(英文) Department of Computer Science and Information Engineering
外國學位學校名稱
外國學位學院名稱
外國學位研究所名稱
學年度 112
學期 2
出版年 113
研究生(中文) 林昆達
研究生(英文) KUN-DA LIN
學號 612410018
學位類別 碩士
語言別 繁體中文
第二語言別
口試日期 2024-07-02
論文頁數 72頁
口試委員 指導教授 - 張志勇(cychang@mail.tku.edu.tw)
口試委員 - 陳裕賢
口試委員 - 陳宗禧
關鍵字(中) 自然語言
電腦視覺
行為辨識
影片斷點切割
Transformer Encoder
多模態模型
CLIP對比學習
關鍵字(英) Natural Language
Computer Vision
Behavior Recognition
Video Breakpoint Segmentation
Transformer Encoder
Multimodal Model
CLIP Contrastive Learning
第三語言關鍵字
學科別分類
中文摘要
對於街頭異常行為的偵測而言,傳統巡邏與監控手段效率不足,增加了執法難度。隨著AI技術的進步,許多基於多模態的監控系統也隨之推出,但這些系統通常依賴滑動視窗法重複檢測當前片段,導致事件偵測的反應時間延遲。此外,街頭環境聲音雜亂,收音音質模糊,影響語音轉文字技術的效果,進而影響多模態系統的整體表現,另外,在特徵融合過程中,各模態間的時空特徵常被忽略,導致重要訊息的遺失。
針對這些問題,本論文提出了一個創新的多模態辨識系統,結合視覺、聲音和語意分析,實現「Coarse-Grained多模態孿生異常事件起點偵測」和「Fine-Grained多模態時序事件種類辨識」。在Coarse-Grained異常事件起點偵測階段,透過Resnet50和VGGish作為基礎特徵提取器,從影片幀和音頻片段中抓取特徵,利用一維卷積神經網路整合和提煉這些特徵,進行二階段訓練。第一階段使用分類任務訓練,透過二元交叉商損失修正模型分類結果,讓模型具有基礎辨識正常或異常的能力;第二階段將先前的多模態特徵提取器組成孿生網路架構,基於餘弦相似度損失修正模型對於輸入兩組片段的相似度,來強化模型判斷片段相似度的能力。此階段系統比對輸入片段與異常事件特徵向量庫的相似度,對輸入的影片進行有效的前處理,輸出剪輯過的精確異常事件片段,有效解決了反應時間延遲的問題。
在確定事件起始點後,系統進行Fine-Grained多模態時序事件種類辨識,透過SSIM指數識別關鍵影像幀,並利用Resnet50和Transformer增強時空特徵處理。聲音模態每秒提取MFCC頻譜圖,經VGGish和Transformer處理增強時序特徵。對於聲音模糊問題,採用CLIP進行圖文Zero-Shot Classification,有效整合文字模態。最終,透過Transformer融合聲音、文字及影像三種模態特徵,準確分類預定事件類型,克服了特徵融合中時空訊息遺失的挑戰。
相較於其他模型,本研究採用的多模態分析方法在監控異常行為方面展現了顯著優勢,特別是在XD-Violence和UCF-Crime這兩個資料集的異常行為辨識任務上,模型在不進行額外修改的情況下,效能約提升了1.2%。
英文摘要
For the detection of abnormal behavior on the streets, traditional patrolling and monitoring methods are inefficient, increasing the difficulty of law enforcement. With advancements in AI technology, various multimodal monitoring systems have been introduced. However, these systems often rely on sliding window methods to repeatedly detect current segments, leading to delayed response times in event detection. Additionally, the noisy street environments and poor audio quality impact the effectiveness of speech-to-text technologies, subsequently affecting the overall performance of multimodal systems. Moreover, the spatiotemporal features between different modalities are often overlooked during the feature fusion process, resulting in the loss of critical information.
To address these issues, this thesis presents an innovative multimodal recognition system that integrates visual, auditory, and semantic analysis to achieve "Coarse-Grained Multimodal Twin Anomaly Event Onset Detection" and "Fine-Grained Multimodal Temporal Event Type Recognition". In the Coarse-Grained anomaly event onset detection phase, Resnet50 and VGGish are used as base feature extractors to capture features from video frames and audio segments. A one-dimensional convolutional neural network integrates and refines these features, undergoing two-stage training. The first stage uses classification tasks to train and adjusts the model classification results through binary cross-entropy loss, giving the model the basic capability to recognize normal or abnormal events. In the second stage, the previous multimodal feature extractors are combined into a twin network structure, based on cosine similarity loss to adjust the model's ability to judge the similarity of two input segments. This stage compares the similarity between input segments and the anomaly event feature vector library, effectively pre-processing the input videos and resolving the problem of delayed response times.
After determining the event's onset, the system conducts Fine-Grained multimodal temporal event type recognition, identifying key video frames through the SSIM index and enhancing spatiotemporal feature processing with Resnet50 and Transformer. The auditory modality extracts the MFCC spectrogram every second, which is enhanced by VGGish and Transformer for temporal feature processing. For the issue of audio fuzziness, CLIP is employed for text-image Zero-Shot Classification, effectively integrating the text modality. Finally, the Transformer merges the auditory, textual, and visual modal features, accurately classifying the predetermined event types, overcoming the challenge of lost spatiotemporal information during feature fusion.
Compared to other models, the multimodal analysis method used in this study shows significant advantages in monitoring abnormal behavior, especially in the abnormal behavior recognition tasks on the XD-Violence and UCF-Crime datasets, where the model's performance improved by approximately 1.2% without additional modifications.
第三語言摘要
論文目次
目錄

誌謝	I
目錄	VI
圖目錄	IX
表目錄	XII
第一章 簡介	1
第二章 相關研究	7
2.1 影片異常事件的起點偵測	7
2.1.1 基於機器學習分類:	7
2.1.2 基於深度學習分類:	9
2.2 行為辨識	11
2.2.1 基於單模態偵測	11
2.2.2 基於多模態偵測	12
2.3 總覽	14
第三章 背景知識	16
3.1 Transformer	16
3.2 ViViT	17
3.3 Resnet50	19
3.4 VGGish	20
3.5 CLIP	22
3.6 RoBERTa	24
第四章 系統設計	26
4.1 整體架構	26
4.2 資料蒐集與前處理	27
4.2.1 資料蒐集	27
4.2.2 資料前處理	29
4.3 多模態孿生異常事件起點偵測	30
4.3.1 Multimodal Encoder	31
4.3.2 Frame-based Siamese Event Detection Network	36
4.4 多模態時序事件種類辨識	42
4.4.1 影像局部時序特徵提取	43
4.4.2 場景描述局部時序特徵提取	46
4.4.3 聲音局部時序特徵提取	51
4.4.4 多模態全局特徵融合分類	55
第五章 實驗分析	56
5.1 資料集	56
5.2 環境與系統參數設定	56
5.3 實驗結果	58
5.3.1 多模態孿生異常事件起點偵測之效能	59
5.3.2 多模態時序事件種類辨識之效能	60
第六章 結論	66
參考文獻	67

圖目錄
圖 1、系統研究架構	3
圖 2、Transformer架構	17
圖 3、Video Vision Transformer架構	19
圖 4 、ResNet50網路架構	20
圖 5、VGG網路架構	22
圖 6、CLIP訓練、使用流程	23
圖 7、系統設計整體架構	27
圖 8、畫面距離擴增效果	29
圖 9、隨機空間擦除效果	30
圖 10、多模態孿生異常事件起點偵測整體目標	31
圖 11、多模態特徵提取器架構	32
圖 12、訓練資料取樣方法	34
圖 13、多模態模型訓練方法	35
圖 14、多模態特徵提取器使用期架構	36
圖 15、多模態孿生網路架構	38
圖 16、正負樣本配對方法	40
圖 17、幀級別孿生事件檢測網路架構	40
圖 18、異常事件向量資料庫建立	41
圖 19、多模態事件檢測模型使用架構	42
圖 20、本研究影片取樣方法	42
圖 21、細粒度影片級別時序事件種類辨識整體目標	43
圖 22、每秒依據SSIM取幀結果	44
圖 23、影像特徵提取過程	44
圖 24、Visual Temporal Encoder訓練期架構圖	45
圖 25、Visual Temporal Encoder使用期架構圖	46
圖 26、語意模態轉換目標	46
圖 27、Contrastive Language-Image Pre-Training架構	47
圖 28、事件集合圖組成方法	49
圖 29、場景描述文字轉換過程	49
圖 30、圖文對比學習架構	50
圖 31、圖文對比學習模型使用方法	51
圖 32、聲音取樣過程	52
圖 33、聲音特徵提取過程	53
圖 34、Audio Temporal Encoder訓練期架構圖	54
圖 35、Audio Temporal Encoder使用期架構圖	54
圖 36、全局特徵融合分類過程	55
圖 37、本研究中不同模型訓練回合之收斂結果	57
圖 38、不同Similarity Threshold在XD-Violence和UCF-Crime中效能變化	59
圖 39、XD-Violence中測試的影片序列	62
圖 40、在XD-Violence中基於CAM方法模型特徵可視化結果	63
圖 41、UCF-Crime中測試的影片序列	63
圖 42、在UCF-Crime中基於CAM方法模型特徵可視化結果	64

表目錄

表 1、相關研究比較表	15
表 2、Cross Entropy Loss之符號定義	33
表 3、Cosine Similarity Loss之符號定義	38
表 4、Contrastive Loss之符號定義	48
表 5、本研究系統實驗環境	57
表 6、本研究中模型參數設置	57
表 7、混淆矩陣表格	58
表 8、多模態孿生異常事件起點偵測模型和同性質研究做法的AP值比較	60
表 9、消融實驗結果	61
表 10、使用Cross Validation對多模態時序事件種類辨識模型AUC /AP值比較	65

參考文獻
參考文獻
[1]	Viet-Tuan Le, Yong-Guk Kim, "Attention-based residual autoencoder for video anomaly detection," Appl Intell 53, 3240–3254, May. 2022.
[2]	Sareer Ul Amin, Mohib Ullah, Muhammad Sajjad, Faouzi Alaya Cheikh, Mohammad Hijji, Abdulrahman Hijji, and Khan Muhammad, "EADN: An Efficient Deep Learning Model for Anomaly Detection in Videos, "Mathematics 10, no. 9, April. 2022.
[3]	Sardar Waqar Khan, Qasim Hafeez, Muhammad Irfan Khalid, Roobaea Alroobaea, Saddam Hussain, Jawaid Iqbal, Jasem Almotiri, and Syed Sajid Ullah, "Anomaly Detection in Traffic Surveillance Videos Using Deep Learning," Sensors 22, no. 17, August. 2022.
[4]	Slim HAMDI, Samir BOUINDOUR, Kais LOUKIL, Hichem SNOUSSI and Mohamed ABID, "Hybrid deep learning and HOF for Anomaly Detection," International Conference on Control, Decision and Information Technologies, Paris, France, pp, April. 2019.
[5]	Jingtao Hu, En Zhu, Siqi Wang, Xinwang Liu, Xifeng Guo, and Jianping Yin, "An Efficient and Robust Unsupervised Anomaly Detection Method Using Ensemble Random Projection in Surveillance Videos," Sensors 19, no. 19, September. 2019.
[6]	J. He, Y. Ren, L. Zhai and W. Liu, "FCC-MF: Detecting Violence in Audio-Visual Context with Frame-Wise Cluster Contrast and Modality-Stage Flooding," ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8346-8350, Seoul, Korea, 2024.
[7]	Zhixin Shu, Kiwon Yun, Dimitris Samaras, "Action Detection with Improved Dense Trajectories and Sliding Window," Computer Vision - ECCV Workshops. Computer Science(LNIP), vol. 8925, pp. 541-551, Springer, March, 2015.
[8]	L. Sun, X. Yang and C. Hu, "DSWHAR: A Dynamic Sliding Window Based Human Activity Recognition Method," 2022 IEEE Smartworld, Ubiquitous Intelligence & Computing, Scalable Computing & Communications, Digital Twin, Privacy Computing, Metaverse, Autonomous & Trusted Vehicles (SmartWorld/UIC/ScalCom/DigitalTwin/PriComp/Meta), pp. 1421-1426, Haikou, China, 2022.
[9]	V. -D. Le, T. -L. Nghiem and T. -L. Le, "Accurate continuous action and gesture recognition method based on skeleton and sliding windows techniques," 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 284-290, Taipei, Taiwan, 2023.
[10]	M. B. Shaikh, D. Chai, S. M. S. Islam and N. Akhtar, "MAiVAR: Multimodal Audio-Image and Video Action Recognizer," 2022 IEEE International Conference on Visual Communications and Image Processing (VCIP), pp. 1-5, Suzhou, China, 2022.
[11]	Ortega, Juan D. S., Mohammed Senoussaoui, Eric Granger, Marco Pedersoli, Patrick Cardinal and Alessandro Lameiras Koerich, "Multimodal Fusion with Deep Neural Networks for Audio-Video Emotion Recognition," ArXiv abs/1907.03196, 2019.
[12]	B. Yang, Q. Zhang and Z. Liu, "ICANet: A Method of Short Video Emotion Recognition Driven by Multimodal Data," 2022 2nd International Conference on Networking Systems of AI (INSAI), pp. 22-25, Shanghai, China, 2022.
[13]	Vaswani, A., Shazeer, N.M., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., & Polosukhin, I, "Attention is All you Need," Neural Information Processing Systems, 2017.
[14]	Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N, "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale," ArXiv, abs/2010.11929, 2020.
[15]	Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lucic, M., & Schmid, C., "ViViT: A Video Vision Transformer," 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 6816-6826, 2021.
[16]	Devlin, J., Chang, M., Lee, K., & Toutanova, K., "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," North American Chapter of the Association for Computational Linguistics, 2019.
[17]	Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V., "RoBERTa: A Robustly Optimized BERT Pretraining Approach, " ArXiv, abs/1907.11692, 2019.
[18]	He, K., Zhang, X., Ren, S., & Sun, J., "Deep Residual Learning for Image Recognition," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770-778, 2016.
[19]	Hershey, S., Chaudhuri, S., Ellis, D.P., Gemmeke, J.F., Jansen, A., Moore, R.C., Plakal, M., Platt, D., Saurous, R.A., Seybold, B., Slaney, M., Weiss, R.J., & Wilson, K.W, "CNN architectures for large-scale audio classification," 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 131-135, 2017.
[20]	J. F. Gemmeke et al., "Audio Set: An ontology and human-labeled dataset for audio events," 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776-780, New Orleans, LA, USA, 2017.
[21]	Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I, "Learning Transferable Visual Models From Natural Language Supervision," International Conference on Machine Learning, 2021.
[22]	M. M. Soliman, M. H. Kamal, M. A. El-Massih Nashed, Y. M. Mostafa, B. S. Chawky and D. Khattab, "Violence Recognition from Videos using Deep Learning Techniques," 2019 Ninth International Conference on Intelligent Computing and Information Systems (ICICIS), Cairo, Egypt, pp. 80-85, 2019.
[23]	M. Perez, A. C. Kot and A. Rocha, "Detection of Real-world Fights in Surveillance Videos, "ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, pp. 2662-2666, 2019.
[24]	Peng Wu, Jing Liu, Yujia Shi, Yujia Sun, Fangtao Shao, Zhaoyang Wu, and Zhiwei Yang, "Not only Look, But Also Listen: Learning Multimodal Violence Detection Under Weak Supervision," In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science, vol 12375, Springer, 2020.
[25]	Asifullah Khan, Zunaira Rauf, Anabia Sohail, Abdul Rehman Khan, Hifsa Asif, Aqsa Asif, and Umair Farooq, "A survey of the vision transformers and their CNN-transformer based variants," Artificial Intelligence Review, 56(Suppl 3), 2917-2970, 2023.
[26]	Zhou Wang, A. C. Bovik, H. R. Sheikh and E. P. Simoncelli, "Image quality assessment: from error visibility to structural similarity," in IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600-612, April, 2004.
[27]	OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman and Diogo Almeida, "Gpt-4 technical report," arXiv preprint arXiv:2303.08774, 2023.
[28]	S. Davis and P. Mermelstein, "Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences," in IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 28, no. 4, pp. 357-366, August, 1980.
[29]	M. DALLEL, V. HAVARD, D. BAUDRY and X. SAVATIER, "InHARD - Industrial Human Action Recognition Dataset in the Context of Industrial Collaborative Robotics," 2020 IEEE International Conference on Human-Machine Systems (ICHMS), Rome, Italy, 2020.
[30]	Y. Tian, G. Pang, Y. Chen, R. Singh, J. W. Verjans and G. Carneiro, "Weakly-supervised Video Anomaly Detection with Robust Temporal Feature Magnitude Learning," 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 2021, pp. 4955-4966.
[31]	Rendón-Segador, F.J., Álvarez-García, J.A., Salazar-González, J.L., & Tommasi, T, "CrimeNet: Neural Structured Learning using Vision Transformer for violence detection," Neural networks : the official journal of the International Neural Network Society, 161, 318-329 , 2023.
[32]	J. Deng, W. Dong, R. Socher, L. -J. Li, Kai Li and Li Fei-Fei, "ImageNet: A large-scale hierarchical image database," 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 2009, pp. 248-255.
[33]	Bernhard Schölkopf, Robert Williamson, Alex Smola, John Shawe-Taylor, and John Platt, " Support vector method for novelty detection, "In Proceedings of the 12th International Conference on Neural Information Processing Systems (NIPS'99), MIT Press, Cambridge, MA, USA, 1999.
[34]	M. Hasan, J. Choi, J. Neumann, A. K. Roy-Chowdhury and L. S. Davis, "Learning Temporal Regularity in Video Sequences," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 733-742.
[35]	Waqas Sultani, Chen Chen, and Mubarak Shah, "Real-world anomaly detection in surveillance videos, " in CVPR, 2018, pp. 6479–6488.
[36]	Wenfeng Pang, Wei Xie, Qianhua He, Yanxiong Li, and Jichen Yang, "Audiovisual dependency attention for violence detection in videos, " IEEE TMM, 2022.
[37]	Dong-Lai Wei, Chen-Geng Liu, Yang Liu, Jing Liu, XiaoGuang Zhu, and Xin-Hua Zeng, "Look, listen and pay more attention: Fusing multi-modal information for video violence detection, " in ICASSP, 2022, pp. 1980–1984.
[38]	Degardin, Bruno Manuel, "Weakly and partially supervised learning frameworks for anomaly detection, "MS thesis,  Universidade da Beira Interior (Portugal), 2020.
[39]	H. Lv, C. Zhou, Z. Cui, C. Xu, Y. Li and J. Yang, "Localizing Anomalies From Weakly-Labeled Videos," in IEEE Transactions on Image Processing, vol. 30, pp. 4505-4515, 2021.
[40]	Zhou, Bolei, et al, "Learning deep features for discriminative localization," Proceedings of the IEEE conference on computer vision and pattern recognition, 2016.
論文全文使用權限
國家圖書館
不同意無償授權國家圖書館
校內
校內紙本論文立即公開
電子論文全文不同意授權
校內書目立即公開
校外
不同意授權予資料庫廠商
校外書目立即公開

如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信