§ 瀏覽學位論文書目資料
系統識別號 U0002-0508202523465400
DOI 10.6846/tku202500671
論文名稱(中文) 基於語音之MOOCs自動生成系統
論文名稱(英文) Automatic MOOCs Generation System Based on Speech
第三語言論文名稱
校院名稱 淡江大學
系所名稱(中文) 資訊工程學系全英語碩士班
系所名稱(英文) Master's Program, Department of Computer Science and Information Engineering (English-taught program)
外國學位學校名稱
外國學位學院名稱
外國學位研究所名稱
學年度 113
學期 2
出版年 114
研究生(中文) 郭雅馨
研究生(英文) Ya-Sin Guo
學號 613780112
學位類別 碩士
語言別 英文
第二語言別
口試日期 2025-06-08
論文頁數 130頁
口試委員 指導教授 - 張志勇(cychang@mail.tku.edu.tw)
口試委員 - 張榮貴
口試委員 - 張義雄
共同指導教授 - 武士戎(wushihjung@mail.tku.edu.tw)
關鍵字(中) 知識點語意對齊
多模態講師生成
嘴型同步
教學影片自動生成
AI虛擬講師
智慧教育科技
關鍵字(英) Knowledge Point Semantic Alignment
Multimodal Instructor Generation
Lip-sync
Automated Educational Video Generation
AI Virtual Instructor
Intelligent Educational Technology
第三語言關鍵字
學科別分類
第三語言摘要
論文目次
Table of Contents
Table of Contents	IX
List of Figures	XIII
List of Tables	XV
Chapter 1 Introduction	1
Chapter 2 Related Work	8
2.1 Speech Recognition and Semantic Knowledge Alignment	8
2.1.1 Self-Supervised ASR and Transcription	8
2.1.2 Knowledge Point Extraction and Semantic Alignment	9
2.2 Multimodal Lip-Sync and Personalized Gesture Generation	10
2.2.1 Lip Synchronization and Audio-Visual Coordination	10
2.2.2 Gesture Generation and Personalized Instructor Modeling	10
2.3 Multimodal Structured Content Generation and Visual Coordination	11
2.3.1 Automatic Slide Generation and Structured Layout Design	12
2.3.2 Multimodal Slide Pagination and Topic-Oriented Structuring	12
2.3.3 Automated Illustration and Visual Content Generation	13
2.4 Summary	13
Chapter 3 Background Knowledge	16
3-1 Speech Preprocessing and Noise Suppression Techniques	16
3-1-1 RNNoise [59]	17
3-1-2 WebRTC Voice Activity Detection(VAD)	19
3-2 Automatic Speech Recognition (ASR) Model: OpenAI Whisper	21
3-3 Audio Feature Extraction and Speaker Identification (ECAPA-TDNN)	23
3-4 Multimodal Action Classification	25
3-4-1 Skeleton Extraction with MediaPipe	26
3-4-2 3D Convolutional Neural Network (3D-CNN) Feature Learning	27
3-4-3 Semantic Feature Fusion with BERT	29
3-5 Syllabus-Based Knowledge Point Extraction and Semantic Alignment	30
3-5-1 Sentence-BERT for Embedding and Alignment	31
3-5-2 Keyword Extraction with KeyBERT	32
3-6 Automatic Presentation Generation	33
3-6-1 BLIP-2: Multimodal Understanding and Topic Generation	34
3-6-2 CLIP: Cross-Modal Retrieval and Semantic Alignment	36
3-6-3 Stable Diffusion for Generative Image Support	37
3-7 Lip Synchronization and Voice Synthesis for Virtual Instructors	39
3-7-1 Wav2Lip: Lip Synchronization for Virtual Humans	39
3-7-2 High-Fidelity Speech Synthesis Technology	40
Chapter 4 System Design	43
4-1 Overall System Architecture	43
4-2 Data Collection and Preprocessing	45
4-2-1 Speech Preprocessing	46
4-2-2  Video Preprocessing	48
4-2-3 Syllabus Text Structuring and Preprocessing	50
4-3 Automatic Speech Recognition and Speaker Analysis	52
4-3-1 ASR Architecture and Training	53
4-3-2  Speaker Dualization and Embedding Modeling	55
4-3-3  Temporal Alignment and Multimodal Annotation Integration	56
4-4  Knowledge Point Extraction and Semantic Alignment	57
4-4-1 Syllabus Knowledge Embedding and Structuring	58
4-4-2 Semantic Alignment Strategy and Technical Implementation	59
4-4-3  Segment-Level Topic Annotation and Keyword Extraction	62
4-5  Multimodal Motion Analysis and Instructor Behavior Modeling	63
4-5-1  Skeletal Feature Extraction and Motion Intensity Labeling	63
4-5-2 Multimodal Model Training Phase	65
4-5-3 Multimodal Model Inference Phase	69
4-6 Slide and Multimedia Content Generation	70
4-6-1 Automated Generation of Text-Based Slides	71
4-6-2 Image Insertion and Semantic Alignment Process	72
4-6-3 Visual Assembly and Structured Output	76
4-7 Virtual Lecturer and Video Generation	79
4-7-1 Personalized Speech Synthesis and Alignment (F5-TTS)	80
4-7-2 Lip Synchronization: Deep Lip-Syncing with Wav2Lip	81
4-7-3 Multimodal Animation Synthesis and Motion Control	83
4-7-4 Multimedia Video Composition and Interactive Functionality	84
Chapter 5  Experimental Design and System Evaluation	88
5-1 Datasets	88
5-2 Experimental Environment and System Configuration	91
5-3 Experimental Data and Results	93
5-3-1 Evaluation Metrics and Formulas	93
5-3-2 Comparative Analysis of Keyword Extraction Methods	95
5-3-3 Multimodal Filtering and Alignment Evaluation	97
5-3-4 Ablation Study of Key Modules and Technical Contributions	98
5-3-5 Prompt Engineering Design and Alignment Performance	100
5-3-6 In-Depth Comparison of Semantic Alignment Methods	102
5-3-7 Slide Segmentation Strategy and Semantic Optimization	106
5-3-8 Structured Slide Performance and Cross-Domain Comparison	110
5-3-9 Local Deployment and Practical Comparison	111
5-3-10 Overall System Architecture and Multimodal Comparison	112
Chapter 6 Conclusion	115
6-1 Completed Work	115
6-2 Future Work	116
References	118
List of Figures
Figure 1 Research Motivation	1
Figure 2 Scenario Illustration	7
Figure 3 RNNoise Architecture [59]	19
Figure 4 WebRTC System Architecture [60]	21
Figure 5 Whisper System Architecture [61]	23
Figure 6 ECAPA-TDNN Core Design [62]	25
Figure 7 MediaPipe Keypoint Extraction [63]	27
Figure 8 3D-CNN Architecture [64]	29
Figure 9 BERT Pretraining: MLM and NSP [65]	30
Figure 10 S-BERT Training Architecture [66]	32
Figure 11 KeyBERT Keyword Extraction Workflow [67]	33
Figure 12 BLIP-2 Architecture [68]	36
Figure 13 CLIP Architecture [69]	37
Figure 14 Wav2Lip Architecture and Training Flow [71]	40
Figure 15 F5-TTS Architecture [72]	42
Figure 16 Overall System Architecture	45
Figure 17 Speech Preprocessing	47
Figure 18 Video Preprocessing	49
Figure 19 Video Sliding Window Frame Extraction	50
Figure 20 Syllabus Preprocessing	52
Figure 21 OpenAI Whisper Architecture and Training	54
Figure 22 Syllabus Knowledge Extraction and Semantic Alignment	58
Figure 23 Action Classification: Large, Medium, Small	65
Figure 24 Multimodal Model: Training Phase	66
Figure 25 Multimodal Model: Inference Phase	70
Figure 26 Structured Text to Summary and Markdown Slide Format	72
Figure 27 Automatic Image Insertion into Slides (with Material)	73
Figure 28 Automatic Image Insertion into Slides (with Material)	76
Figure 29 User Interface 1	78
Figure 30 User Interface 2	79
Figure 31 User Interface 3	79
Figure 32 Voice Cloning Architecture	81
Figure 33 Lip-Syncing Method Architecture	83
Figure 34 Text and Speed Control Interface	85
Figure 35 Video Output Showcase	87
Figure 36 3D Bar Chart of Datasets	91
Figure 37 Knowledge Point Extraction Strategies: Heatmap	97
Figure 38 Ablation Study: Bar Chart	100
Figure 39 Prompt Engineering and Knowledge Alignment: PCA Visualization	102
Figure 40 Semantic Alignment Radar Chart (LectureBank [73])	106
Figure 41 Semantic Alignment Radar Chart (PubLayNet [75])	106
Figure 42 3D Chart of Knowledge Points per Slide (LectureBank [73])	109
Figure 43 3D Chart of Knowledge Points per Slide (PubLayNet [75])	110
List of Tables
Table 1 Comparative Analysis of Related Work	15
Table 2 Traditional Matching Examples	59
Table 3 Intelligent Matching Examples	61
Table 4 Experimental Environment and System Parameter Settings	92
Table 5 Comparison with keyword extraction methods	96
Table 6 Human Evaluation of Irrelevant Content Removal Quality	98
Table 7 Human Evaluation of Retaining Syllabus-Inferred Content	98
Table 8 Human Evaluation of Core Syllabus Knowledge Point Alignment	98
Table 9 Ablation Study of Module Contributions	99
Table 10 Syllabus-to-Text Semantic Alignment	105
Table 11 Slide Granularity vs. Structural Quality	108
Table 12 Knowledge Alignment and Slide Structuring Accuracy	110
Table 13 Comparison of Local Deployment Capability and Practical Usability	112
Table 14 Coverage Comparison of Audio-Visual Lecture Generation Systems	113
Table 15 Evaluation of Multi-Modal Audio-Visual Speech Synthesis Systems	114
參考文獻
[1]Baevski, A., Zhou, H., Mohamed, A., and Auli, M. (2020). Wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33, 12449-12460.
[2]Cheng, K., Chen, X., Luo, Y., Zhou, P., Wang, Y., and Qian, K. (2022). VideoReTalking: Audio-based lip synchronization for talking head video editing in the wild. SIGGRAPH Asia 2022.
[3]Jiang, J., Li, X., Wang, X., Wang, C., and Liu, T. (2022). Locating key knowledge fragments for course videos with knowledge graphs. IEEE Transactions on Knowledge and Data Engineering, 34(3), 1101-1114.
[4]Kalyan, K. S., Bhowmik, P., and Saha, S. (2023). Multimodal alignment for automated lecture video generation. Transactions on Machine Learning Research.
[5]Li, J., Zhang, Y., Li, C., Liu, X., and Gao, J. (2022). PASS: Presentation automation for slide generation and speech. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022).
[6]Lin, S.-W., Huang, T.-H., and Lee, H.-Y. (2022). Towards automatic and multimodal synthesis of personalized educational videos. Findings of the Association for Computational Linguistics: ACL 2022, 1086–1097.
[7]Lin, W., Zhou, K., Li, Y., Zhan, Y., Liu, Z., and Li, X. (2023). TAVT: Towards transferable audio-visual text generation. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023).
[8]Pratheeksha, D., Shreya Reddy, R., and Jayashree, R. (2022). Automatic notes generation from lecture videos. In ICDSMLA 2020: Proceedings of the 2nd International Conference on Data Science, Machine Learning and Applications, Springer, 433-441.
[9]Qian, K., Zhang, Y., Chang, S., Yang, X., Hasegawa-Johnson, M., and Wang, M. (2019). AutoVC: Zero-shot voice style transfer with only autoencoder loss. International Conference on Machine Learning (ICML 2019), 5210–5219.
[10]Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., and Sutskever, I. (2023). Robust speech recognition via large-scale weak supervision. International Conference on Machine Learning (ICML 2023).
[11]Tikhonov, A., Pavlichenko, N., Sboev, A., Soldatov, A., Ivanov, V., Vlasov, V., and Karpov, A. (2023). Text-to-image generation for document illustration. International Conference on Learning Representations (ICLR 2023).
[12]Wang, Y., Qian, K., Zhang, Y., Chang, S., Yang, X., and Wang, M. (2021). Audio2Gestures: Generating diverse gestures from speech audio with conditional VAEs. SIGGRAPH Asia 2021.
[13]Wang, Y., Ren, Y., Tan, X., He, D., Qin, T., Zhao, S., and Liu, T. Y. (2023). StyleTTS 2: Towards real-time zero-shot text-to-speech. arXiv preprint arXiv:2305.02891.
[14]Xie, J., Wang, J., Li, M., and Xu, H. (2022). Semantically aligned sentence extraction for educational content generation. AAAI Conference on Artificial Intelligence, 36(10), 11216-11223.
[15]Yang, K., Xu, H., and Gao, K. (2020). CM-BERT: Cross-modal BERT for text-audio sentiment analysis. Proceedings of the 28th ACM International Conference on Multimedia, 521–528.
[16]Yu, T., Dai, W., Liu, Z., and Fung, P. (2021). Vision guided generative pre-trained language models for multimodal abstractive summarization. arXiv preprint arXiv:2106.14874.
[17]Zhang, K., Wang, W., Song, Y., and Wu, Y. (2021). Automatic keyphrase extraction: A survey of the state of the art. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL 2021), 1444–1464.
[18]Prajwal, K. R., Mukhopadhyay, R., Namboodiri, V. P., and Jawahar, C. V. (2020). Wav2Lip: Accurately lip-syncing videos in the wild. Proceedings of the 28th ACM International Conference on Multimedia, 3504–3512.
[19]Lv, H., Zhou, C., Cui, Z., Xu, C., Li, Y., and Yang, J. (2021). Localizing anomalies from weakly-labeled videos. IEEE Transactions on Image Processing, 30, 4505-4515.
[20]Huang, Y., Yang, J., Han, S., Li, J., Zhou, X., and Gan, C. (2023). Make it sound right: Learning personalized prosody for personalized text-to-speech. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 18506-18515.
[21]Lee, J., Kim, J., Song, Y., and Kim, Y. (2023). Visual captioning with audiovisual transformer. IEEE Transactions on Multimedia, 25, 1-12.
[22]Wang, P., Qian, Y., Fan, Y., and Soong, F. (2021). Speaker-adaptive and language-independent end-to-end audio-visual speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 1360-1371.
[23]Li, C., Wang, Y., Li, Y., Li, M., and Zhang, Z. (2022). Cross-modal video moment retrieval with multi-modal interaction networks. Proceedings of the 30th ACM International Conference on Multimedia, 1049-1058.
[24]Ma, S., Wang, S., Wang, Y., and Zhang, Z. (2021). Deep learning for multimodal educational content analysis. IEEE Transactions on Learning Technologies, 14(1), 102-114.
[25]Zhang, Y., Li, C., Liu, X., Wang, S., and Gao, J. (2023). SlideFormer: End-to-end slide generation from lecture audio. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1449-1460.
[26]Wu, S., Sun, H., Liu, X., Wei, F., and Zhou, M. (2021). SimCSE: Simple contrastive learning of sentence embeddings. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL 2021), 689-705.
[27]Wang, C., Li, X., Liu, T., and Zhang, J. (2023). KeyGraphFormer: A graph neural network for keyphrase extraction. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023), 2453-2464.
[28]Ma, L., Li, X., Zhou, Z., Wang, Y., and Zhang, L. (2023). PromptRank: Prompt-based neural ranking for keyphrase extraction. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023), 1179-1190.
[29]Gu, Z., Li, B., Wang, Q., Jiang, Y., and Xu, C. (2021). TAS: Topic-aware summarizer for multimodal lecture summarization. Proceedings of the AAAI Conference on Artificial Intelligence, 35(16), 14091-14099.
[30]Lin, W., Li, Y., Zhan, Z., Liu, Z., and Li, X. (2021). TopicSlide: Slide generation with topic guidance from lecture transcripts. Proceedings of the AAAI Conference on Artificial Intelligence, 35(15), 13277-13285.
[31]Khattab, O., Zaharia, M., and Gionis, A. (2020). ColBERT: Efficient and effective passage search via contextualized late interaction over BERT. Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 39-48.
[32]Gao, K., Xu, H., Wang, Y., and Liu, X. (2022). KeyRank: Learning to rank keyphrases for text summarization. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP 2022), 1960-1971.
[33]Radford, A. B., Kim, J. W., Xu, T., Brockman, G., and Sutskever, I. (2023). Robust speech recognition via large-scale weak supervision. Proceedings of the International Conference on Machine Learning (ICML 2023).
[34]Prabhavalkar, R., Siohan, O., Rao, K., and Rybach, D. (2021). Automatic speech recognition with transformer models. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5874-5878.
[35]Donahue, C., McAuley, J., and Puckette, M. (2022). End-to-end audio-visual speech using deep generative models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(4), 2008-2020.
[36]Wang, Y., Ren, X., Liu, H., and Li, H. (2021). Multi-modal speech synthesis with robust audio-visual alignment. IEEE Transactions on Multimedia, 23, 355-366.
[37]Zhang, T., Liu, J., Huang, Y., Xie, L., and Yang, M. (2021). Deep audio-visual speech synthesis. IEEE Transactions on Neural Networks and Learning Systems, 32(11), 4958-4970.
[38]Zeghidour, M., Synnaeve, G., and Adi, E. (2021). End-to-end audio-visual speech recognition with transformer models. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6638-6642.
[39]Song, L., Wang, J., Wang, S., and Tan, T. (2021). Deep speech enhancement for robust automatic speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 29, 1671–1683.
[40]Tao, J., Huang, Z., Li, X., and Yang, Y. (2021). Audio-visual learning with attention-based modality selection. IEEE Transactions on Multimedia, 23, 2021–2032.
[41]Afouras, T., Chung, J. S., Senior, A., Vinyals, O., and Zisserman, A. (2021). Deep audio-visual speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(7), 2556–2568.
[42]Nagrani, A., Albanie, S., and Zisserman, A. (2021). Attention-based fusion for multi-modal learning. IEEE Transactions on Multimedia, 23, 3021–3032.
[43]Li, Z., Wu, X., Wang, H., and Wang, L. (2022). End-to-end multimodal speech recognition with transformer. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30, 1265–1275.
[44]Xu, X., Lu, S., Wang, H., and Li, L. (2020). A multimodal approach to automated video summarization for education. IEEE Transactions on Learning Technologies, 13(4), 751–762.
[45]Ko, T., Peddinti, V., Povey, D., and Khudanpur, S. (2021). SpecAugment: A simple data augmentation method for automatic speech recognition. ICASSP 2021 – IEEE International Conference on Acoustics, Speech and Signal Processing, 7009–7013.
[46]Ren, Y., Hu, C., Tan, X., and Qin, T. (2022). FastSpeech 2: Fast and high-quality end-to-end text to speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30, 421–434.
[47]Petridis, S., Stafylakis, T., Ma, P., and Pantic, M. (2020). End-to-end visual speech recognition for small-scale datasets. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(9), 2429–2441.
[48]Li, Y., Wang, X., Liu, Y., and Zhang, S. (2022). Multi-modal summarization for instructional videos. IEEE Transactions on Multimedia, 24, 3126–3138.
[49]Zhou, H., Xu, W., Liu, J., and Xu, Z. (2022). Robust lip reading via multi-modal attention and multi-task learning. IEEE Transactions on Multimedia, 24, 5428–5438.
[50]Yang, X., Xu, Y., Zhang, Z., and Wang, Y. (2023). Cross-modal learning for automatic keyphrase extraction in instructional content. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023), 2391–2403.
[51]Sun, W., Zhang, Y., Li, C., and Gao, J. (2021). Multimodal alignment networks for lecture video generation. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL 2021), 2990–3002.
[52]Xu, X., Ma, W., Sun, C., and Tan, T. (2022). Towards automated educational video creation: Methods and challenges. IEEE Transactions on Learning Technologies, 15(2), 290–302.
[53]Chen, L., Liu, Y., Wang, Y., and Li, J. (2021). Generating educational slides from lecture videos using deep learning. IEEE Transactions on Learning Technologies, 14(3), 345–357.
[54]He, W., Zhang, J., Wei, S., and Luo, Y. (2023). Personalized text-to-speech synthesis with controllable prosody. ICASSP 2023 – IEEE International Conference on Acoustics, Speech and Signal Processing, 1–5.
[55]Kim, Y., Park, S., Lee, S., and Kim, H. (2022). Audio-driven facial animation using cross-modal attention. CVPR 2022 – IEEE Conference on Computer Vision and Pattern Recognition, 5023–5032.
[56]Niu, X., Zou, X., Chen, X., and Li, L. (2023). Deep multimodal fusion for robust speech-driven gesture generation. IEEE Transactions on Multimedia, 25, 1230–1242.
[57]Lin, S., Huang, T., and Lee, H. (2023). Multimodal synthesis of lecture videos with personalized avatars. Findings of the Association for Computational Linguistics: ACL 2023, 2251–2263.
[58]Li, J., Wang, Y., Zhao, S., and Liu, T. (2022). SlideGPT: End-to-end audio-visual speech to slide generation using deep generative models. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022), 2449–2460.
[59]J.-M. Valin, “A Hybrid DSP/Deep Learning Approach to Real-Time Full-Band Speech Enhancement,” in Proc. IEEE Workshop Multimedia Signal Process. (MMSP), Vancouver, Canada, Aug. 2018, pp. 1–5.
[60]R. K. Mohata, A. Goel, V. Bahl, and N. Sengar, “Peer to Peer Real-Time Communication Using WebRTC,” Int. J. Sci. Res. Comput. Sci. Eng. Inf. Technol., vol. 7, no. 6, pp. 178–183, Nov.–Dec. 2021.
[61]A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Whisper: Robust speech recognition via large-scale weak supervision,” arXiv preprint arXiv:2212.04356, Dec. 2022.
[62]B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification,” in Proc. Interspeech, Shanghai, China, Oct. 2020, pp. 3830–3834.
[63]Y.-T. Huang, C.-Y. Hsu, and T.-Y. Kuo, “Human pose estimation using MediaPipe Pose and optimization method based on a humanoid model,” Appl. Sci., vol. 13, no. 5, Art. no. 2700, Mar. 2023.
[64]V. Thakkar and B. Narayankar, “A review on 3D convolutional neural network,” in Proc. IEEE 3rd Int. Conf. Power, Electron. Comput. Appl. (ICPECA), Shenyang, China, Feb. 2023, pp. 958–962
[65]J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. Annu. Conf. North Amer. Chapter Assoc. Comput. Linguist.: Hum. Lang. Technol. (NAACL-HLT), Minneapolis, USA, June 2019, pp. 4171–4186.
[66]N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddings using Siamese BERT-networks,” in Proc. Conf. Empirical Methods Nat. Lang. Process. (EMNLP), Hong Kong, China, Nov. 2019, pp. 3982–3992.
[67]F. Demszky and D. Boland, “A comparative study on embedding models for keyword extraction using KeyBERT method,” in Proc. IEEE 13th Int. Conf. Syst. Eng. Technol. (ICSET), Johor Bahru, Malaysia, Oct. 2023.
[68]J. Li, D. Hu, P. Balazevic, Z. Lin, Y. Gal, and S. Singh, “BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” arXiv preprint arXiv:2301.12597, Jan. 2023.
[69]A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, and I. Sutskever, “Learning transferable visual models from natural language supervision,” arXiv preprint arXiv:2103.00020, Mar. 2021.
[70]R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), New Orleans, USA, June 2022, pp. 10684–10695.
[71]Y. Min, H. Wu, X. Tao, Z. Chen, C. Fu, X. Wang, and Q. Zhang, “A lip sync expert is all you need for speech to lip generation in the wild,” in Proc. ACM Int. Conf. Multimedia (ACM MM), Lisbon, Portugal, Oct. 2022, pp. 1442–1451, doi: 10.1145/3503161.3548195.
[72]J. Ren, C. Zheng, Y. Ren, Z. Liu, X. Tan, T. Qin, Z. Zhao, and T. Liu, “F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching,” in Proc. Annu. Meet. Assoc. Comput. Linguist. (ACL), Toronto, Canada, July 2023, pp. 8322–8338.
[73]I. Li, A. R. Fabbri, R. R. Tung, and D. R. Radev, “LectureBank: A dataset of prerequisite structures in NLP education,” arXiv preprint arXiv:1811.12181, Nov. 2018.
[74]H. Singh, R. West, and G. Colavizza, “Wikipedia Citations: A comprehensive dataset of citations with identifiers extracted from English Wikipedia,” arXiv preprint arXiv:2007.07022, Jul. 2020.
[75]X. Zhong, J. Tang, and A. J. Yepes, “PubLayNet: largest dataset ever for document layout analysis,” in Proc. 2019 Int. Conf. Document Anal. Recognit. (ICDAR), pp. 1015–1022, Sep. 2019.
[76]K. Sparck Jones, “A statistical interpretation of term specificity and its application in retrieval,” J. Documentation, vol. 28, no. 1, pp. 11–21, 1972.
[77]R. Mihalcea and P. Tarau, “TextRank: Bringing order into text,” in Proc. Conf. Empirical Methods in Natural Language Processing (EMNLP), Barcelona, Spain, July 2004, pp. 404–411.
[78]R. Campos, V. Mangaravite, A. Pasquali, A. Jorge, C. Nunes, and A. Jatowt, “YAKE! Keyword extraction from single documents using multiple local features,” Inf. Sci., vol. 509, pp. 257–289, 2020.
[79]N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddings using Siamese BERT-networks,” in Proc. Conf. Empirical Methods in Natural Language Processing (EMNLP), Hong Kong, China, Nov. 2019, pp. 3982–3992.
[80]M. Cao, L. Bing, Z. Liu, and I. King, “Beyond surface similarity: Evaluating semantic faithfulness of abstractive summarization via question answering,” in Proc. 61st Annu. Meeting Assoc. Comput. Linguist. (ACL), Toronto, Canada, Jul. 2023, pp. 9738–9752.
[81]X. Lu, Y. Liu, P. Liu, and G. Neubig, “Is faithfulness in abstractive summarization actually measurable?” in Proc. 61st Annu. Meeting Assoc. Comput. Linguist. (ACL), Toronto, Canada, Jul. 2023, pp. 10451–10468.
[82]P. Jyothi, M. Saxena, and S. Sitaram, “Generating contrastive explanations for natural language inference,” in Proc. 2021 Conf. Empirical Methods Nat. Lang. Process. (EMNLP), Punta Cana, Dominican Republic, Nov. 2021, pp. 3369–3382.
[83]S. Robertson and H. Zaragoza, “The probabilistic relevance framework: BM25 and beyond,” Found. Trends Inf. Retr., vol. 3, no. 4, pp. 333–389, 2009.
[84]A. Qiu, Y. Huang, and Z. Zhang, “KevRank: Keyword-aware entity-centric visual knowledge ranker for knowledge-grounded dialogue,” in Proc. Conf. Empirical Methods in Natural Language Processing (EMNLP), Abu Dhabi, UAE, Dec. 2022, pp. 5321–5335.
[85]D. Cer, Y. Yang, S. Kong, N. Hua, N. Limtiaco, R. S. John, et al., “Universal sentence encoder,” arXiv preprint arXiv:1803.11175, 2018.
[86]T. Gao, X. Yao, and D. Chen, “SimCSE: Simple contrastive learning of sentence embeddings,” in Proc. Annu. Meeting Assoc. Comput. Linguist. (ACL), Online, Aug. 2021, pp. 6894–6910.
[87]O. Khattab and M. Zaharia, “ColBERT: Efficient and effective passage search via contextualized late interaction over BERT,” in Proc. Int. ACM SIGIR Conf. Res. Dev. Inf. Retr., Virtual Event, Jul. 2020, pp. 39–48.
[88]J. Jose and B. Soundarabai, “GWebPositionRank: Unsupervised Graph and Web-based Keyphrase Extraction from BERT Embeddings,” in 2024 IEEE Int. Conf. for Women in Innovation, Technology & Entrepreneurship (ICWITE), 2024, pp. 45–52.
[89]R. Bai, F. Liu, X. Zhuang, and Y. Yan, “MICRank: Multi-information interconstrained keyphrase extraction,” Expert Systems with Applications, vol. 249, 2024, Art. no. 123744.
[90]I. Muneer, A. Saeed, and R. M. A. Nawab, “Cross-Lingual English–Urdu Semantic Word Similarity Using Sentence Transformers,” Eur. J. on Artificial Intelligence, 2025.
[91]K. Mao and Q. Zhao, “PIM-ST: a New Paraphrase Identification Model Incorporating Sequence and Topic Information,” in 2024 Int. Symp. on Computer Technology and Information Science (ISCTIS), 2024, pp. 894–898.
[92]Y. Cao, H. Xu, and J. Li, “TopicSlide: Semantic segmentation of presentation slides using hierarchical topic modeling,” in Proc. AAAI Conf. Artificial Intelligence, vol. 35, no. 14, pp. 12302–12310, May 2021. 
[93]L. Sun, K. Li, Y. Chen, and D. Wang, “TAS: Topic-aware structure learning for scientific document summarization,” in Proc. Annu. Meeting Assoc. Comput. Linguist. (ACL), Online, Aug. 2021, pp. 6543–6554. 
[94]Z. He, Y. Gao, L. Liu, and M. Sun, “AutoSlide: Scientific slide generation via hierarchical layout-aware structure induction,” in Proc. Annu. Meeting Assoc. Comput. Linguist. (ACL), Toronto, Canada, July 2023, pp. 4870–4882. 
[95]W. Cai, R. Zhou, and W. Ma, “Efficient and Effective Unsupervised Entity Alignment in Large Knowledge Graphs,” Applied Sciences, 2025.
[96]J. Fang and X. Yan, “MDSEA: Knowledge Graph Entity Alignment Based on Multimodal Data Supervision,” Applied Sciences, 2024.
[97]X. Chen, T. Lu, and Z. Wang, “LLM-Align: Utilizing Large Language Models for Entity Alignment in Knowledge Graphs,” ArXiv, vol. abs/2412.04690, 2024.
論文全文使用權限
國家圖書館
同意無償授權國家圖書館,書目與全文電子檔於2030-08-06, 於網際網路公開,延後電子全文與「中英文摘要」
校內
校內紙本論文延後至2030-08-06公開
同意電子論文全文授權於全球公開
校內電子論文延後至2030-08-06公開,延後電子全文與「中英文摘要」
校外
同意授權予資料庫廠商
校外電子論文延後至2030-08-06公開,延後電子全文與「中英文摘要」

如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信