| 系統識別號 | U0002-0508202523465400 |
|---|---|
| DOI | 10.6846/tku202500671 |
| 論文名稱(中文) | 基於語音之MOOCs自動生成系統 |
| 論文名稱(英文) | Automatic MOOCs Generation System Based on Speech |
| 第三語言論文名稱 | |
| 校院名稱 | 淡江大學 |
| 系所名稱(中文) | 資訊工程學系全英語碩士班 |
| 系所名稱(英文) | Master's Program, Department of Computer Science and Information Engineering (English-taught program) |
| 外國學位學校名稱 | |
| 外國學位學院名稱 | |
| 外國學位研究所名稱 | |
| 學年度 | 113 |
| 學期 | 2 |
| 出版年 | 114 |
| 研究生(中文) | 郭雅馨 |
| 研究生(英文) | Ya-Sin Guo |
| 學號 | 613780112 |
| 學位類別 | 碩士 |
| 語言別 | 英文 |
| 第二語言別 | |
| 口試日期 | 2025-06-08 |
| 論文頁數 | 130頁 |
| 口試委員 |
指導教授
-
張志勇(cychang@mail.tku.edu.tw)
口試委員 - 張榮貴 口試委員 - 張義雄 共同指導教授 - 武士戎(wushihjung@mail.tku.edu.tw) |
| 關鍵字(中) |
知識點語意對齊 多模態講師生成 嘴型同步 教學影片自動生成 AI虛擬講師 智慧教育科技 |
| 關鍵字(英) |
Knowledge Point Semantic Alignment Multimodal Instructor Generation Lip-sync Automated Educational Video Generation AI Virtual Instructor Intelligent Educational Technology |
| 第三語言關鍵字 | |
| 學科別分類 | |
| 第三語言摘要 | |
| 論文目次 |
Table of Contents Table of Contents IX List of Figures XIII List of Tables XV Chapter 1 Introduction 1 Chapter 2 Related Work 8 2.1 Speech Recognition and Semantic Knowledge Alignment 8 2.1.1 Self-Supervised ASR and Transcription 8 2.1.2 Knowledge Point Extraction and Semantic Alignment 9 2.2 Multimodal Lip-Sync and Personalized Gesture Generation 10 2.2.1 Lip Synchronization and Audio-Visual Coordination 10 2.2.2 Gesture Generation and Personalized Instructor Modeling 10 2.3 Multimodal Structured Content Generation and Visual Coordination 11 2.3.1 Automatic Slide Generation and Structured Layout Design 12 2.3.2 Multimodal Slide Pagination and Topic-Oriented Structuring 12 2.3.3 Automated Illustration and Visual Content Generation 13 2.4 Summary 13 Chapter 3 Background Knowledge 16 3-1 Speech Preprocessing and Noise Suppression Techniques 16 3-1-1 RNNoise [59] 17 3-1-2 WebRTC Voice Activity Detection(VAD) 19 3-2 Automatic Speech Recognition (ASR) Model: OpenAI Whisper 21 3-3 Audio Feature Extraction and Speaker Identification (ECAPA-TDNN) 23 3-4 Multimodal Action Classification 25 3-4-1 Skeleton Extraction with MediaPipe 26 3-4-2 3D Convolutional Neural Network (3D-CNN) Feature Learning 27 3-4-3 Semantic Feature Fusion with BERT 29 3-5 Syllabus-Based Knowledge Point Extraction and Semantic Alignment 30 3-5-1 Sentence-BERT for Embedding and Alignment 31 3-5-2 Keyword Extraction with KeyBERT 32 3-6 Automatic Presentation Generation 33 3-6-1 BLIP-2: Multimodal Understanding and Topic Generation 34 3-6-2 CLIP: Cross-Modal Retrieval and Semantic Alignment 36 3-6-3 Stable Diffusion for Generative Image Support 37 3-7 Lip Synchronization and Voice Synthesis for Virtual Instructors 39 3-7-1 Wav2Lip: Lip Synchronization for Virtual Humans 39 3-7-2 High-Fidelity Speech Synthesis Technology 40 Chapter 4 System Design 43 4-1 Overall System Architecture 43 4-2 Data Collection and Preprocessing 45 4-2-1 Speech Preprocessing 46 4-2-2 Video Preprocessing 48 4-2-3 Syllabus Text Structuring and Preprocessing 50 4-3 Automatic Speech Recognition and Speaker Analysis 52 4-3-1 ASR Architecture and Training 53 4-3-2 Speaker Dualization and Embedding Modeling 55 4-3-3 Temporal Alignment and Multimodal Annotation Integration 56 4-4 Knowledge Point Extraction and Semantic Alignment 57 4-4-1 Syllabus Knowledge Embedding and Structuring 58 4-4-2 Semantic Alignment Strategy and Technical Implementation 59 4-4-3 Segment-Level Topic Annotation and Keyword Extraction 62 4-5 Multimodal Motion Analysis and Instructor Behavior Modeling 63 4-5-1 Skeletal Feature Extraction and Motion Intensity Labeling 63 4-5-2 Multimodal Model Training Phase 65 4-5-3 Multimodal Model Inference Phase 69 4-6 Slide and Multimedia Content Generation 70 4-6-1 Automated Generation of Text-Based Slides 71 4-6-2 Image Insertion and Semantic Alignment Process 72 4-6-3 Visual Assembly and Structured Output 76 4-7 Virtual Lecturer and Video Generation 79 4-7-1 Personalized Speech Synthesis and Alignment (F5-TTS) 80 4-7-2 Lip Synchronization: Deep Lip-Syncing with Wav2Lip 81 4-7-3 Multimodal Animation Synthesis and Motion Control 83 4-7-4 Multimedia Video Composition and Interactive Functionality 84 Chapter 5 Experimental Design and System Evaluation 88 5-1 Datasets 88 5-2 Experimental Environment and System Configuration 91 5-3 Experimental Data and Results 93 5-3-1 Evaluation Metrics and Formulas 93 5-3-2 Comparative Analysis of Keyword Extraction Methods 95 5-3-3 Multimodal Filtering and Alignment Evaluation 97 5-3-4 Ablation Study of Key Modules and Technical Contributions 98 5-3-5 Prompt Engineering Design and Alignment Performance 100 5-3-6 In-Depth Comparison of Semantic Alignment Methods 102 5-3-7 Slide Segmentation Strategy and Semantic Optimization 106 5-3-8 Structured Slide Performance and Cross-Domain Comparison 110 5-3-9 Local Deployment and Practical Comparison 111 5-3-10 Overall System Architecture and Multimodal Comparison 112 Chapter 6 Conclusion 115 6-1 Completed Work 115 6-2 Future Work 116 References 118 List of Figures Figure 1 Research Motivation 1 Figure 2 Scenario Illustration 7 Figure 3 RNNoise Architecture [59] 19 Figure 4 WebRTC System Architecture [60] 21 Figure 5 Whisper System Architecture [61] 23 Figure 6 ECAPA-TDNN Core Design [62] 25 Figure 7 MediaPipe Keypoint Extraction [63] 27 Figure 8 3D-CNN Architecture [64] 29 Figure 9 BERT Pretraining: MLM and NSP [65] 30 Figure 10 S-BERT Training Architecture [66] 32 Figure 11 KeyBERT Keyword Extraction Workflow [67] 33 Figure 12 BLIP-2 Architecture [68] 36 Figure 13 CLIP Architecture [69] 37 Figure 14 Wav2Lip Architecture and Training Flow [71] 40 Figure 15 F5-TTS Architecture [72] 42 Figure 16 Overall System Architecture 45 Figure 17 Speech Preprocessing 47 Figure 18 Video Preprocessing 49 Figure 19 Video Sliding Window Frame Extraction 50 Figure 20 Syllabus Preprocessing 52 Figure 21 OpenAI Whisper Architecture and Training 54 Figure 22 Syllabus Knowledge Extraction and Semantic Alignment 58 Figure 23 Action Classification: Large, Medium, Small 65 Figure 24 Multimodal Model: Training Phase 66 Figure 25 Multimodal Model: Inference Phase 70 Figure 26 Structured Text to Summary and Markdown Slide Format 72 Figure 27 Automatic Image Insertion into Slides (with Material) 73 Figure 28 Automatic Image Insertion into Slides (with Material) 76 Figure 29 User Interface 1 78 Figure 30 User Interface 2 79 Figure 31 User Interface 3 79 Figure 32 Voice Cloning Architecture 81 Figure 33 Lip-Syncing Method Architecture 83 Figure 34 Text and Speed Control Interface 85 Figure 35 Video Output Showcase 87 Figure 36 3D Bar Chart of Datasets 91 Figure 37 Knowledge Point Extraction Strategies: Heatmap 97 Figure 38 Ablation Study: Bar Chart 100 Figure 39 Prompt Engineering and Knowledge Alignment: PCA Visualization 102 Figure 40 Semantic Alignment Radar Chart (LectureBank [73]) 106 Figure 41 Semantic Alignment Radar Chart (PubLayNet [75]) 106 Figure 42 3D Chart of Knowledge Points per Slide (LectureBank [73]) 109 Figure 43 3D Chart of Knowledge Points per Slide (PubLayNet [75]) 110 List of Tables Table 1 Comparative Analysis of Related Work 15 Table 2 Traditional Matching Examples 59 Table 3 Intelligent Matching Examples 61 Table 4 Experimental Environment and System Parameter Settings 92 Table 5 Comparison with keyword extraction methods 96 Table 6 Human Evaluation of Irrelevant Content Removal Quality 98 Table 7 Human Evaluation of Retaining Syllabus-Inferred Content 98 Table 8 Human Evaluation of Core Syllabus Knowledge Point Alignment 98 Table 9 Ablation Study of Module Contributions 99 Table 10 Syllabus-to-Text Semantic Alignment 105 Table 11 Slide Granularity vs. Structural Quality 108 Table 12 Knowledge Alignment and Slide Structuring Accuracy 110 Table 13 Comparison of Local Deployment Capability and Practical Usability 112 Table 14 Coverage Comparison of Audio-Visual Lecture Generation Systems 113 Table 15 Evaluation of Multi-Modal Audio-Visual Speech Synthesis Systems 114 |
| 參考文獻 |
[1]Baevski, A., Zhou, H., Mohamed, A., and Auli, M. (2020). Wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33, 12449-12460. [2]Cheng, K., Chen, X., Luo, Y., Zhou, P., Wang, Y., and Qian, K. (2022). VideoReTalking: Audio-based lip synchronization for talking head video editing in the wild. SIGGRAPH Asia 2022. [3]Jiang, J., Li, X., Wang, X., Wang, C., and Liu, T. (2022). Locating key knowledge fragments for course videos with knowledge graphs. IEEE Transactions on Knowledge and Data Engineering, 34(3), 1101-1114. [4]Kalyan, K. S., Bhowmik, P., and Saha, S. (2023). Multimodal alignment for automated lecture video generation. Transactions on Machine Learning Research. [5]Li, J., Zhang, Y., Li, C., Liu, X., and Gao, J. (2022). PASS: Presentation automation for slide generation and speech. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022). [6]Lin, S.-W., Huang, T.-H., and Lee, H.-Y. (2022). Towards automatic and multimodal synthesis of personalized educational videos. Findings of the Association for Computational Linguistics: ACL 2022, 1086–1097. [7]Lin, W., Zhou, K., Li, Y., Zhan, Y., Liu, Z., and Li, X. (2023). TAVT: Towards transferable audio-visual text generation. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023). [8]Pratheeksha, D., Shreya Reddy, R., and Jayashree, R. (2022). Automatic notes generation from lecture videos. In ICDSMLA 2020: Proceedings of the 2nd International Conference on Data Science, Machine Learning and Applications, Springer, 433-441. [9]Qian, K., Zhang, Y., Chang, S., Yang, X., Hasegawa-Johnson, M., and Wang, M. (2019). AutoVC: Zero-shot voice style transfer with only autoencoder loss. International Conference on Machine Learning (ICML 2019), 5210–5219. [10]Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., and Sutskever, I. (2023). Robust speech recognition via large-scale weak supervision. International Conference on Machine Learning (ICML 2023). [11]Tikhonov, A., Pavlichenko, N., Sboev, A., Soldatov, A., Ivanov, V., Vlasov, V., and Karpov, A. (2023). Text-to-image generation for document illustration. International Conference on Learning Representations (ICLR 2023). [12]Wang, Y., Qian, K., Zhang, Y., Chang, S., Yang, X., and Wang, M. (2021). Audio2Gestures: Generating diverse gestures from speech audio with conditional VAEs. SIGGRAPH Asia 2021. [13]Wang, Y., Ren, Y., Tan, X., He, D., Qin, T., Zhao, S., and Liu, T. Y. (2023). StyleTTS 2: Towards real-time zero-shot text-to-speech. arXiv preprint arXiv:2305.02891. [14]Xie, J., Wang, J., Li, M., and Xu, H. (2022). Semantically aligned sentence extraction for educational content generation. AAAI Conference on Artificial Intelligence, 36(10), 11216-11223. [15]Yang, K., Xu, H., and Gao, K. (2020). CM-BERT: Cross-modal BERT for text-audio sentiment analysis. Proceedings of the 28th ACM International Conference on Multimedia, 521–528. [16]Yu, T., Dai, W., Liu, Z., and Fung, P. (2021). Vision guided generative pre-trained language models for multimodal abstractive summarization. arXiv preprint arXiv:2106.14874. [17]Zhang, K., Wang, W., Song, Y., and Wu, Y. (2021). Automatic keyphrase extraction: A survey of the state of the art. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL 2021), 1444–1464. [18]Prajwal, K. R., Mukhopadhyay, R., Namboodiri, V. P., and Jawahar, C. V. (2020). Wav2Lip: Accurately lip-syncing videos in the wild. Proceedings of the 28th ACM International Conference on Multimedia, 3504–3512. [19]Lv, H., Zhou, C., Cui, Z., Xu, C., Li, Y., and Yang, J. (2021). Localizing anomalies from weakly-labeled videos. IEEE Transactions on Image Processing, 30, 4505-4515. [20]Huang, Y., Yang, J., Han, S., Li, J., Zhou, X., and Gan, C. (2023). Make it sound right: Learning personalized prosody for personalized text-to-speech. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 18506-18515. [21]Lee, J., Kim, J., Song, Y., and Kim, Y. (2023). Visual captioning with audiovisual transformer. IEEE Transactions on Multimedia, 25, 1-12. [22]Wang, P., Qian, Y., Fan, Y., and Soong, F. (2021). Speaker-adaptive and language-independent end-to-end audio-visual speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 1360-1371. [23]Li, C., Wang, Y., Li, Y., Li, M., and Zhang, Z. (2022). Cross-modal video moment retrieval with multi-modal interaction networks. Proceedings of the 30th ACM International Conference on Multimedia, 1049-1058. [24]Ma, S., Wang, S., Wang, Y., and Zhang, Z. (2021). Deep learning for multimodal educational content analysis. IEEE Transactions on Learning Technologies, 14(1), 102-114. [25]Zhang, Y., Li, C., Liu, X., Wang, S., and Gao, J. (2023). SlideFormer: End-to-end slide generation from lecture audio. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1449-1460. [26]Wu, S., Sun, H., Liu, X., Wei, F., and Zhou, M. (2021). SimCSE: Simple contrastive learning of sentence embeddings. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL 2021), 689-705. [27]Wang, C., Li, X., Liu, T., and Zhang, J. (2023). KeyGraphFormer: A graph neural network for keyphrase extraction. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023), 2453-2464. [28]Ma, L., Li, X., Zhou, Z., Wang, Y., and Zhang, L. (2023). PromptRank: Prompt-based neural ranking for keyphrase extraction. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023), 1179-1190. [29]Gu, Z., Li, B., Wang, Q., Jiang, Y., and Xu, C. (2021). TAS: Topic-aware summarizer for multimodal lecture summarization. Proceedings of the AAAI Conference on Artificial Intelligence, 35(16), 14091-14099. [30]Lin, W., Li, Y., Zhan, Z., Liu, Z., and Li, X. (2021). TopicSlide: Slide generation with topic guidance from lecture transcripts. Proceedings of the AAAI Conference on Artificial Intelligence, 35(15), 13277-13285. [31]Khattab, O., Zaharia, M., and Gionis, A. (2020). ColBERT: Efficient and effective passage search via contextualized late interaction over BERT. Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 39-48. [32]Gao, K., Xu, H., Wang, Y., and Liu, X. (2022). KeyRank: Learning to rank keyphrases for text summarization. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP 2022), 1960-1971. [33]Radford, A. B., Kim, J. W., Xu, T., Brockman, G., and Sutskever, I. (2023). Robust speech recognition via large-scale weak supervision. Proceedings of the International Conference on Machine Learning (ICML 2023). [34]Prabhavalkar, R., Siohan, O., Rao, K., and Rybach, D. (2021). Automatic speech recognition with transformer models. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5874-5878. [35]Donahue, C., McAuley, J., and Puckette, M. (2022). End-to-end audio-visual speech using deep generative models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(4), 2008-2020. [36]Wang, Y., Ren, X., Liu, H., and Li, H. (2021). Multi-modal speech synthesis with robust audio-visual alignment. IEEE Transactions on Multimedia, 23, 355-366. [37]Zhang, T., Liu, J., Huang, Y., Xie, L., and Yang, M. (2021). Deep audio-visual speech synthesis. IEEE Transactions on Neural Networks and Learning Systems, 32(11), 4958-4970. [38]Zeghidour, M., Synnaeve, G., and Adi, E. (2021). End-to-end audio-visual speech recognition with transformer models. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6638-6642. [39]Song, L., Wang, J., Wang, S., and Tan, T. (2021). Deep speech enhancement for robust automatic speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 29, 1671–1683. [40]Tao, J., Huang, Z., Li, X., and Yang, Y. (2021). Audio-visual learning with attention-based modality selection. IEEE Transactions on Multimedia, 23, 2021–2032. [41]Afouras, T., Chung, J. S., Senior, A., Vinyals, O., and Zisserman, A. (2021). Deep audio-visual speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(7), 2556–2568. [42]Nagrani, A., Albanie, S., and Zisserman, A. (2021). Attention-based fusion for multi-modal learning. IEEE Transactions on Multimedia, 23, 3021–3032. [43]Li, Z., Wu, X., Wang, H., and Wang, L. (2022). End-to-end multimodal speech recognition with transformer. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30, 1265–1275. [44]Xu, X., Lu, S., Wang, H., and Li, L. (2020). A multimodal approach to automated video summarization for education. IEEE Transactions on Learning Technologies, 13(4), 751–762. [45]Ko, T., Peddinti, V., Povey, D., and Khudanpur, S. (2021). SpecAugment: A simple data augmentation method for automatic speech recognition. ICASSP 2021 – IEEE International Conference on Acoustics, Speech and Signal Processing, 7009–7013. [46]Ren, Y., Hu, C., Tan, X., and Qin, T. (2022). FastSpeech 2: Fast and high-quality end-to-end text to speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30, 421–434. [47]Petridis, S., Stafylakis, T., Ma, P., and Pantic, M. (2020). End-to-end visual speech recognition for small-scale datasets. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(9), 2429–2441. [48]Li, Y., Wang, X., Liu, Y., and Zhang, S. (2022). Multi-modal summarization for instructional videos. IEEE Transactions on Multimedia, 24, 3126–3138. [49]Zhou, H., Xu, W., Liu, J., and Xu, Z. (2022). Robust lip reading via multi-modal attention and multi-task learning. IEEE Transactions on Multimedia, 24, 5428–5438. [50]Yang, X., Xu, Y., Zhang, Z., and Wang, Y. (2023). Cross-modal learning for automatic keyphrase extraction in instructional content. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023), 2391–2403. [51]Sun, W., Zhang, Y., Li, C., and Gao, J. (2021). Multimodal alignment networks for lecture video generation. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL 2021), 2990–3002. [52]Xu, X., Ma, W., Sun, C., and Tan, T. (2022). Towards automated educational video creation: Methods and challenges. IEEE Transactions on Learning Technologies, 15(2), 290–302. [53]Chen, L., Liu, Y., Wang, Y., and Li, J. (2021). Generating educational slides from lecture videos using deep learning. IEEE Transactions on Learning Technologies, 14(3), 345–357. [54]He, W., Zhang, J., Wei, S., and Luo, Y. (2023). Personalized text-to-speech synthesis with controllable prosody. ICASSP 2023 – IEEE International Conference on Acoustics, Speech and Signal Processing, 1–5. [55]Kim, Y., Park, S., Lee, S., and Kim, H. (2022). Audio-driven facial animation using cross-modal attention. CVPR 2022 – IEEE Conference on Computer Vision and Pattern Recognition, 5023–5032. [56]Niu, X., Zou, X., Chen, X., and Li, L. (2023). Deep multimodal fusion for robust speech-driven gesture generation. IEEE Transactions on Multimedia, 25, 1230–1242. [57]Lin, S., Huang, T., and Lee, H. (2023). Multimodal synthesis of lecture videos with personalized avatars. Findings of the Association for Computational Linguistics: ACL 2023, 2251–2263. [58]Li, J., Wang, Y., Zhao, S., and Liu, T. (2022). SlideGPT: End-to-end audio-visual speech to slide generation using deep generative models. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022), 2449–2460. [59]J.-M. Valin, “A Hybrid DSP/Deep Learning Approach to Real-Time Full-Band Speech Enhancement,” in Proc. IEEE Workshop Multimedia Signal Process. (MMSP), Vancouver, Canada, Aug. 2018, pp. 1–5. [60]R. K. Mohata, A. Goel, V. Bahl, and N. Sengar, “Peer to Peer Real-Time Communication Using WebRTC,” Int. J. Sci. Res. Comput. Sci. Eng. Inf. Technol., vol. 7, no. 6, pp. 178–183, Nov.–Dec. 2021. [61]A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Whisper: Robust speech recognition via large-scale weak supervision,” arXiv preprint arXiv:2212.04356, Dec. 2022. [62]B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification,” in Proc. Interspeech, Shanghai, China, Oct. 2020, pp. 3830–3834. [63]Y.-T. Huang, C.-Y. Hsu, and T.-Y. Kuo, “Human pose estimation using MediaPipe Pose and optimization method based on a humanoid model,” Appl. Sci., vol. 13, no. 5, Art. no. 2700, Mar. 2023. [64]V. Thakkar and B. Narayankar, “A review on 3D convolutional neural network,” in Proc. IEEE 3rd Int. Conf. Power, Electron. Comput. Appl. (ICPECA), Shenyang, China, Feb. 2023, pp. 958–962 [65]J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. Annu. Conf. North Amer. Chapter Assoc. Comput. Linguist.: Hum. Lang. Technol. (NAACL-HLT), Minneapolis, USA, June 2019, pp. 4171–4186. [66]N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddings using Siamese BERT-networks,” in Proc. Conf. Empirical Methods Nat. Lang. Process. (EMNLP), Hong Kong, China, Nov. 2019, pp. 3982–3992. [67]F. Demszky and D. Boland, “A comparative study on embedding models for keyword extraction using KeyBERT method,” in Proc. IEEE 13th Int. Conf. Syst. Eng. Technol. (ICSET), Johor Bahru, Malaysia, Oct. 2023. [68]J. Li, D. Hu, P. Balazevic, Z. Lin, Y. Gal, and S. Singh, “BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” arXiv preprint arXiv:2301.12597, Jan. 2023. [69]A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, and I. Sutskever, “Learning transferable visual models from natural language supervision,” arXiv preprint arXiv:2103.00020, Mar. 2021. [70]R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), New Orleans, USA, June 2022, pp. 10684–10695. [71]Y. Min, H. Wu, X. Tao, Z. Chen, C. Fu, X. Wang, and Q. Zhang, “A lip sync expert is all you need for speech to lip generation in the wild,” in Proc. ACM Int. Conf. Multimedia (ACM MM), Lisbon, Portugal, Oct. 2022, pp. 1442–1451, doi: 10.1145/3503161.3548195. [72]J. Ren, C. Zheng, Y. Ren, Z. Liu, X. Tan, T. Qin, Z. Zhao, and T. Liu, “F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching,” in Proc. Annu. Meet. Assoc. Comput. Linguist. (ACL), Toronto, Canada, July 2023, pp. 8322–8338. [73]I. Li, A. R. Fabbri, R. R. Tung, and D. R. Radev, “LectureBank: A dataset of prerequisite structures in NLP education,” arXiv preprint arXiv:1811.12181, Nov. 2018. [74]H. Singh, R. West, and G. Colavizza, “Wikipedia Citations: A comprehensive dataset of citations with identifiers extracted from English Wikipedia,” arXiv preprint arXiv:2007.07022, Jul. 2020. [75]X. Zhong, J. Tang, and A. J. Yepes, “PubLayNet: largest dataset ever for document layout analysis,” in Proc. 2019 Int. Conf. Document Anal. Recognit. (ICDAR), pp. 1015–1022, Sep. 2019. [76]K. Sparck Jones, “A statistical interpretation of term specificity and its application in retrieval,” J. Documentation, vol. 28, no. 1, pp. 11–21, 1972. [77]R. Mihalcea and P. Tarau, “TextRank: Bringing order into text,” in Proc. Conf. Empirical Methods in Natural Language Processing (EMNLP), Barcelona, Spain, July 2004, pp. 404–411. [78]R. Campos, V. Mangaravite, A. Pasquali, A. Jorge, C. Nunes, and A. Jatowt, “YAKE! Keyword extraction from single documents using multiple local features,” Inf. Sci., vol. 509, pp. 257–289, 2020. [79]N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddings using Siamese BERT-networks,” in Proc. Conf. Empirical Methods in Natural Language Processing (EMNLP), Hong Kong, China, Nov. 2019, pp. 3982–3992. [80]M. Cao, L. Bing, Z. Liu, and I. King, “Beyond surface similarity: Evaluating semantic faithfulness of abstractive summarization via question answering,” in Proc. 61st Annu. Meeting Assoc. Comput. Linguist. (ACL), Toronto, Canada, Jul. 2023, pp. 9738–9752. [81]X. Lu, Y. Liu, P. Liu, and G. Neubig, “Is faithfulness in abstractive summarization actually measurable?” in Proc. 61st Annu. Meeting Assoc. Comput. Linguist. (ACL), Toronto, Canada, Jul. 2023, pp. 10451–10468. [82]P. Jyothi, M. Saxena, and S. Sitaram, “Generating contrastive explanations for natural language inference,” in Proc. 2021 Conf. Empirical Methods Nat. Lang. Process. (EMNLP), Punta Cana, Dominican Republic, Nov. 2021, pp. 3369–3382. [83]S. Robertson and H. Zaragoza, “The probabilistic relevance framework: BM25 and beyond,” Found. Trends Inf. Retr., vol. 3, no. 4, pp. 333–389, 2009. [84]A. Qiu, Y. Huang, and Z. Zhang, “KevRank: Keyword-aware entity-centric visual knowledge ranker for knowledge-grounded dialogue,” in Proc. Conf. Empirical Methods in Natural Language Processing (EMNLP), Abu Dhabi, UAE, Dec. 2022, pp. 5321–5335. [85]D. Cer, Y. Yang, S. Kong, N. Hua, N. Limtiaco, R. S. John, et al., “Universal sentence encoder,” arXiv preprint arXiv:1803.11175, 2018. [86]T. Gao, X. Yao, and D. Chen, “SimCSE: Simple contrastive learning of sentence embeddings,” in Proc. Annu. Meeting Assoc. Comput. Linguist. (ACL), Online, Aug. 2021, pp. 6894–6910. [87]O. Khattab and M. Zaharia, “ColBERT: Efficient and effective passage search via contextualized late interaction over BERT,” in Proc. Int. ACM SIGIR Conf. Res. Dev. Inf. Retr., Virtual Event, Jul. 2020, pp. 39–48. [88]J. Jose and B. Soundarabai, “GWebPositionRank: Unsupervised Graph and Web-based Keyphrase Extraction from BERT Embeddings,” in 2024 IEEE Int. Conf. for Women in Innovation, Technology & Entrepreneurship (ICWITE), 2024, pp. 45–52. [89]R. Bai, F. Liu, X. Zhuang, and Y. Yan, “MICRank: Multi-information interconstrained keyphrase extraction,” Expert Systems with Applications, vol. 249, 2024, Art. no. 123744. [90]I. Muneer, A. Saeed, and R. M. A. Nawab, “Cross-Lingual English–Urdu Semantic Word Similarity Using Sentence Transformers,” Eur. J. on Artificial Intelligence, 2025. [91]K. Mao and Q. Zhao, “PIM-ST: a New Paraphrase Identification Model Incorporating Sequence and Topic Information,” in 2024 Int. Symp. on Computer Technology and Information Science (ISCTIS), 2024, pp. 894–898. [92]Y. Cao, H. Xu, and J. Li, “TopicSlide: Semantic segmentation of presentation slides using hierarchical topic modeling,” in Proc. AAAI Conf. Artificial Intelligence, vol. 35, no. 14, pp. 12302–12310, May 2021. [93]L. Sun, K. Li, Y. Chen, and D. Wang, “TAS: Topic-aware structure learning for scientific document summarization,” in Proc. Annu. Meeting Assoc. Comput. Linguist. (ACL), Online, Aug. 2021, pp. 6543–6554. [94]Z. He, Y. Gao, L. Liu, and M. Sun, “AutoSlide: Scientific slide generation via hierarchical layout-aware structure induction,” in Proc. Annu. Meeting Assoc. Comput. Linguist. (ACL), Toronto, Canada, July 2023, pp. 4870–4882. [95]W. Cai, R. Zhou, and W. Ma, “Efficient and Effective Unsupervised Entity Alignment in Large Knowledge Graphs,” Applied Sciences, 2025. [96]J. Fang and X. Yan, “MDSEA: Knowledge Graph Entity Alignment Based on Multimodal Data Supervision,” Applied Sciences, 2024. [97]X. Chen, T. Lu, and Z. Wang, “LLM-Align: Utilizing Large Language Models for Entity Alignment in Knowledge Graphs,” ArXiv, vol. abs/2412.04690, 2024. |
| 論文全文使用權限 |
如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信