| 系統識別號 | U0002-0608202501311900 |
|---|---|
| DOI | 10.6846/tku202500672 |
| 論文名稱(中文) | 虛擬兒女聊天系統 |
| 論文名稱(英文) | Virtual Child Conversational System for Elderly Companionship |
| 第三語言論文名稱 | |
| 校院名稱 | 淡江大學 |
| 系所名稱(中文) | 資訊工程學系全英語碩士班 |
| 系所名稱(英文) | Master's Program, Department of Computer Science and Information Engineering (English-taught program) |
| 外國學位學校名稱 | |
| 外國學位學院名稱 | |
| 外國學位研究所名稱 | |
| 學年度 | 113 |
| 學期 | 2 |
| 出版年 | 114 |
| 研究生(中文) | 蔡馥璟 |
| 研究生(英文) | Fu-Ching Tsai |
| 學號 | 612780063 |
| 學位類別 | 碩士 |
| 語言別 | 英文 |
| 第二語言別 | |
| 口試日期 | 2025-06-28 |
| 論文頁數 | 109頁 |
| 口試委員 |
指導教授
-
張志勇(cychang@mail.tku.edu.tw)
口試委員 - 張義雄 口試委員 - 張榮貴 |
| 關鍵字(中) |
引導式對話 話題來源選擇 偏離度控制 增強式學習 槽位填充 Dueling DQN |
| 關鍵字(英) |
Guided Conversation Topic Source Selection Deviation Degree Control Reinforcement Learning Slot Filling Dueling DQN |
| 第三語言關鍵字 | |
| 學科別分類 | |
| 中文摘要 |
隨著台灣步入超高齡社會,如何有效且溫和地蒐集獨居長者的健康資訊成為迫切議題。傳統封閉式提問常引發防衛心理,亦缺乏對語境偏離的容錯能力。為此,本研究提出一套名為「虛擬兒女聊天系統」之創新對話架構,透過引導式對話、偏離度建模與增強學習決策等模組,提升對話自然度與健康資料蒐集效率。 系統設計分為四大核心模組:(1) 引導式聊天劇本:根據背景、作息與興趣三種話題來源,結合目標槽位(用藥、飲食、睡眠、活動),生成具情境性的多版本對話劇本;(2) 偏離度控制:建構三維偏離度空間,用以量化話題來源、槽位目標與語義偏離程度,實現dead-end rescue話題補救機制;(3) 雙大型語言模型對弈:使用ChatGPT與Gemini模擬兒女與長者對話,快速產出高擬真訓練資料;(4) 增強式學習決策:以Dueling DQN架構學習對話策略,根據槽位填充狀況與偏離度,選擇最佳劇本以提升多輪任務完成率。 實驗以 MultiWOZ 2.0/2.1、ChatGPT-Gemini 模擬語料與偏題擾動版本進行分析,結果顯示本系統在資訊填槽率、偏離容錯力與互動輪數等指標上皆優於傳統方法與消融版本。綜合而言,本研究在多輪健康對話中有效融合情境適應性與策略彈性,為虛擬陪伴型對話系統於高齡照護場域提供具體實作與發展潛力。 |
| 英文摘要 |
As Taiwan officially enters a super-aged society, collecting health information from elderly individuals—especially those living alone—has become increasingly critical. Traditional closed-ended questioning often induces resistance and lacks the flexibility to handle conversational deviations. This study proposes a novel system named the "Virtual Child Conversational System," designed to improve natural interaction and data collection through guided dialogue, deviation modeling, and reinforcement learning-based decision making. The system architecture consists of four key components: (1) Guided Conversation Scripts: Twelve scenario-based dialogues generated from three topic sources—background, routines, and interests—combined with four target slots (medication, diet, sleep, and activity); (2) Deviation Control: A 3D deviation space quantifies topic origin, slot type, and deviation level, enabling dead-end rescue strategies; (3) Dual-Large Language Model Simulation: ChatGPT and Gemini simulate multiround elderly-child dialogues, generating diverse and realistic training samples; (4) Reinforcement Learning with Dueling DQN: Slot filling status and deviation degree inform the selection of optimal scripts to accelerate task completion across dialogue turns. Experiments using MultiWOZ 2.0/2.1, ChatGPT-Gemini simulated datasets, and off-topic variants demonstrate superior performance in slot filling, deviation tolerance, and dialogue efficiency compared to baseline and ablated models. Overall, the proposed system effectively integrates adaptability and strategic flexibility, offering a practical and scalable solution for elderly care through virtual companionship dialogues. |
| 第三語言摘要 | |
| 論文目次 |
Table of Contents Table of Contents VI List of Figures IX List of Tables XII Chapter 1 Introduction 1 Chapter 2 Related Work 14 2.1 Applications of Guided Dialogue Systems 14 2.2 Handling Dialogue Deviation and Topic Recovery Strategies 17 2.3 Reinforcement Learning for Multi-Slot Information Collection 18 2.4 Overview and Comparative Analysis 20 Chapter 3 Background Knowledge 22 3.1 SBERT 23 3.2 Dueling DQN 24 3.3 Wav2Lip 26 3.4 F5TTS 28 3.5 ChatGPT 29 3.6 Gemini 32 Chapter 4 System Design 35 4.1 Overall System Architecture 35 4.2 Data Collection and Preprocessing 39 4.2.1 Children's Voice and Visual Data 39 4.2.2 Elder Information 41 4.3 Script Design and Generation for Guided Conversation 42 4.3.1 Target Slots 43 4.3.2 Guided Chat Script Design 44 4.3.3 Content of Guided Chat Scripts 45 4.4 Duel of Dual Large Language Models 46 4.4.1 Role Assignment of Large Language Models 47 4.5 Reinforcement Learning Framework and Reward Design 53 4.5.1 State 54 4.5.2 Action 55 4.5.3 Reward Function 56 4.6 Model Training Process 60 4.6.1 Source of Input Samples 60 4.6.2 Strategy Learning and Loss Function Design 61 4.6.3 Parameter Update and Optimizer 63 4.7 Deployment Phase 63 4.7.1 Virtual Child Video Generation 64 4.7.2 Virtual Child Dialogue System Application 65 Chapter 5 Experimental Analysis 68 5.1 Environment and System Configuration 68 5.2 Datasets 69 5.2.1 Public Dataset 69 5.2.2 Simulated Dialogue Dataset 71 5.3 Experimental Results 72 5.3.1 Analysis of Existing Model Comparisons 73 5.3.2 Module Ablation Study 86 5.3.3 Model Stability and Learning Efficiency 91 Chapter 6 Conclusion 102 6.1 Completed Work of This Study 102 6.2 Future Work 104 References 106 List of Figures Figure 1. Taiwan Becomes a Super-Aged Society in 2025 1 Figure 2. Top 5 Questions for Elderly People and Young People 2 Figure 3. System Architecture Diagram 7 Figure 4. Research Contributions 10 Figure 5. Architecture of SBERT (Sentence-BERT) Model [29] 24 Figure 6. Reinforcement Learning Framework 25 Figure 7. Architecture of Wav2Lip Model [32] 27 Figure 8. Architecture of F5TTS Model [33] 29 Figure 9. Training Process of GPT [43] 30 Figure 10. Application Scope of GPT Models [43] 31 Figure 11. Gemini’s Long-Context Comprehension Illustration [22] 32 Figure 12. Gemini’s Multimodal Processing Capabilities [22] 33 Figure 13. Overall System Architecture 37 Figure 14. Overall System Architecture 38 Figure 15. Audio Preprocessing Pipeline 40 Figure 16. Multi-Emotion Facial Expression Clips 40 Figure 17. Image Preprocessing Workflow 41 Figure 18. Lip-Syncing Alignment Process 41 Figure 19. Description of Four Target Slots 43 Figure 20. ChatGPT-Based Generation of 12 Script Variants 44 Figure 21. Example of Guided Conversational Script 45 Figure 22. Prompt Engineering for Script Generation 46 Figure 23. Role Allocation: ChatGPT [21], System, and Gemini [22] 47 Figure 24. Script Selection by the System 48 Figure 25. Script and Random Deviation Sent to Gemini [22] 48 Figure 26. Simulated Elderly Response Generated by Gemini [22] 49 Figure 27. Slot Filling Detection by ChatGPT [21] 50 Figure 28. Deviation Scoring by SBERT [29] 50 Figure 29. Low Deviation: Continue Dialogue with Gemini [22] 51 Figure 30. High Deviation: System Requests New Scripts from ChatGPT [21] 52 Figure 31. Continues Dialogue with Gemini [22] Using Regenerated Scripts 52 Figure 32. Example of Dialogue Record with Annotations 53 Figure 33. State Design in Reinforcement Learning Framework 55 Figure 34. Action Design and Script Selection Strategy 56 Figure 35. Input Data Format for Dueling DQN Model [20] 61 Figure 36. Scenario Overview of System Deployment Phase 64 Figure 37. Virtual Child Video Generation Process 65 Figure 38. Functional Modules of the Virtual Child Dialogue System 66 Figure 39. Sample Dialogue Interface on LINE Platform 66 Figure 40. Action Selection via Trained Dueling DQN Policy 67 Figure 41. Response Generation from the Virtual Child 67 Figure 42. Health Report Generation Example 67 Figure 43. Performance heatmap of slot-filling tasks 76 Figure 44. Evaluation line chart on the MultiWOZ 2.0 [36] dataset 80 Figure 45. Evaluation line chart on the MultiWOZ 2.1 [37] dataset 81 Figure 46. Performance on MultiWOZ 2.0 [36] with 30% off-topic utterances 84 Figure 47. Radar chart on the ChatGPT × Gemini dataset 87 Figure 48. Radar chart on the MultiWOZ 2.0 [36] dataset 88 Figure 49. Radar chart on the MultiWOZ 2.1 [37] dataset 89 Figure 50. Different ablated variants on the MultiWOZ 2.0 [36] dataset 91 Figure 51. Different variants on MultiWOZ 2.0 [36] with 30% off-topic utterances 92 Figure 52. Different ablated variants on the ChatGPT × Gemini dataset 93 Figure 53. Training curve of the full model 94 Figure 54. Stability evaluation on the MultiWOZ 2.0 [36] dataset 98 Figure 55. Stability evaluation on the MultiWOZ 2.0 [36] + off-topic dataset 99 Figure 56. Stability evaluation on the ChatGPT × Gemini dataset 100 List of Tables Table 1. Comparative Summary of Related Work 21 Table 2. Deviation Penalty Matrix 58 Table 3. Experimental Environment of the Proposed System 68 Table 4. Performance evaluation of slot-filling tasks 76 Table 5. Performance evaluation on the MultiWOZ 2.0 [36] dataset 80 Table 6. Performance evaluation on the MultiWOZ 2.1 [37] dataset 81 Table 7. Evaluation on MultiWOZ 2.0 [36] with 30% off-topic utterances 84 Table 8. Ablation study on the ChatGPT × Gemini dataset 87 Table 9. Ablation study on the MultiWOZ 2.0 [36] dataset 88 Table 10. Ablation study on the MultiWOZ 2.1 [37] dataset 89 |
| 參考文獻 |
[1] A. Algherairy and M. Ahmed, “A review of dialogue systems: current trends and future directions,” Neural Computing and Applications, vol. 36, no. 12, pp. 6325–6351, 2024. [2] W. He et al., “Unified dialog model pre-training for task-oriented dialog understanding and generation,” in Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 187–200, 2022. [3] W. He et al., “Galaxy: A generative pre-trained model for task-oriented dialog with semi-supervised learning and explicit policy injection,” in Proceedings of the AAAI conference on artificial intelligence, pp. 10749–10757, 2022. [4] Y. Jang, J. Lee, and K.-E. Kim, “GPT-critic: Offline reinforcement learning for end-to-end task-oriented dialogue systems,” in Proceedings of the International Conference on Learning Representations, 2022. [5] H. Jeon and G. G. Lee, “Domain state tracking for a simplified dialogue system,” arXiv preprint arXiv:2103.06648, 2021. [6] Z. Lin, A. Madotto, G. I. Winata, and P. Fung, “Mintl: Minimalist transfer learning for task-oriented dialogue systems,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020. [7]A.Alessa and H. Al-Khalifa, “Towards designing a ChatGPT conversational companion for elderly people,” in Proceedings of the 16th international conference on Pervasive technologies related to assistive environments, pp. 667–674, 2023. [8] N. Gasteiger, K. Loveys, M. Law, and E. Broadbent, “Friends from the future: a scoping review of research into robots and computer agents to combat loneliness in older people,” Clinical interventions in aging, pp. 941–971, 2021. [9] R. Higashinaka, T. Minato, H. Nishizaki, and T. Nagai, “Proceedings of the Dialogue Robot Competition 2023,” arXiv preprint arXiv:2312.14430, 2023. [10] K. McNamara and E. Rudy, “COMPANIONSHIP TO ADDRESS QUALITY OF LIFE AND LONELINESS AMONG OLDER ADULTS WITH SEVERE LONELINESS,” Innovation in Aging, vol. 6, no. Suppl 1, p. 714, 2022. [11] E. Rudy, K. McNamara, R. Patel, and C. Sturm, “A Virtual Companionship Intervention Reduces Loneliness During the COVID-19 Pandemic,” Innovation in Aging, vol. 5, no. Suppl 1, p. 958, 2021. [12] S. Tokunaga, K. Tamura, and M. Otake-Matsuura, “A dialogue-based system with photo and storytelling for older adults: toward daily cognitive training,” Frontiers in Robotics and AI, vol. 8, p. 644964, 2021. [13]T. Nishio et al., “The effects of physically embodied multiple conversation robots on the elderly,” Frontiers in Robotics and AI, vol. 8, p. 633045, 2021. [14]N. Shikha, K. Naidu, A. R. Choudhury, and N. Kayarvizhy, “Smart memory companion for elderly,” in Proceedings of the 2022 4th International Conference on Advances in Computing, Communication Control and Networking (ICAC3N), IEEE, pp. 1497–1502, 2022. [15] A. Kiran, A. Balaram, P. Parshapu, S. L. Naik, P. Purushotham, and M. Silparaj, “AI-Enhanced Elderly Care Companion,” in Proceedings of the 2024 International Conference on Science Technology Engineering and Management (ICSTEM), IEEE, pp. 1–5, 2024. [16] N. Matsumoto and K. Ando, “An Active Listening Dialogue Model Focued on ‘Open Questions’ Using Reinforcement Learning,” in Proceedings of the 2024 16th IIAI International Congress on Advanced Applied Informatics (IIAI-AAI), IEEE, pp. 499–504, 2024. [17] S. Z. Razavi, L. K. Schubert, K. Van Orden, M. R. Ali, B. Kane, and E. Hoque, “Discourse behavior of older adults interacting with a dialogue agent competent in multiple topics,” ACM Transactions on Interactive Intelligent Systems (TiiS), vol. 12, no. 2, pp. 1–21, 2022. [18] C. Zhai and S. Wibowo, “A WGAN-based dialogue system for embedding humor, empathy, and cultural aspects in education,” IEEE Access, vol. 11, pp. 79706–79717, July 2023 [19] Y. Zhao, M. Dastani, J. Long, Z. Wang, and S. Wang, “Rescue Conversations from Dead-ends: Efficient Exploration for Task-oriented Dialogue Policy Optimization,” Transactions of the Association for Computational Linguistics, vol. 12, pp. 1578–1596, 2024. [20] M. Sewak, “Deep q network (dqn), double dqn, and dueling dqn: A step towards general artificial intelligence,” in Deep reinforcement learning: frontiers of artificial intelligence, Springer, pp. 95–108, 2019. [21] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding by generative pre-training,” 2018. [22] G. Team et al., “Gemini: a family of highly capable multimodal models,” arXiv preprint arXiv:2312.11805, 2023. [23] Y. Liu et al., “A review of reinforcement learning for natural language processing and applications in healthcare,” Journal of the American Medical Informatics Association, vol. 31, no. 10, pp. 2379–2393, 2024. [24]K. Lu, S. Zhang, and X. Chen, “Goal-oriented dialogue policy learning from failures,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2019, pp. 2596–2603. [25] Y.-C. Wu and C. E. Rasmussen, “Clipping loops for sample-efficient dialogue policy optimisation,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 3420–3428, 2021. [26] T. H. Bui, M. Rajman, and M. Melichar, “Rapid dialogue prototyping methodology,” in Proceedings of the International Conference on Text, Speech and Dialogue, Springer, pp. 579–586, 2004. [27] Y. Feng et al., “Fantastic rewards and how to tame them: A case study on reward learning for task-oriented dialogue systems,” in Proceedings of the 11th International Conference on Learning Representations (ICLR), 2024. [28]H. Du, S. Li, M. Wu, X. Feng, Y.-F. Li, and H. Wang, “Rewarding What Matters: Step-by-Step Reinforcement Learning for Task-Oriented Dialogue,” in Proceedings of the Findings of the Association for Computational Linguistics: Empirical Methods in Natural Language Processing (EMNLP), 2024. [29] N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 3982–3992, 2019. [30] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, no. 3, pp. 279–292, 1992. [31]V. Mnih et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015. [32] K. R. Prajwal, R. Mukhopadhyay, V. P. Namboodiri, and C. V. Jawahar, “A lip sync expert is all you need for speech to lip generation in the wild,” in Proceedings of the 28th ACM international conference on multimedia, pp. 484–492, 2020. [33] S. E. Eskimez et al., “E2 tts: Embarrassingly easy fully non-autoregressive zero-shot tts,” in Proceedings of the 2024 IEEE Spoken Language Technology Workshop (SLT), IEEE, pp. 682–689, 2024. [34]A. Vaswani et al., “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017. [35] L. Liu et al., “Improving alignment of text-to-image diffusion models with reinforcement learning from human feedback,” in Proceedings of the 31st ACM International Conference on Multimedia (ACM MM), Ottawa, ON, Canada, pp. 222–231, 2023. [36]P. Budzianowski et al., “Multiwoz--a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018. [37]M. Eric et al., “MultiWOZ 2.1: A consolidated multi-domain dialogue dataset with state corrections and state tracking baselines,” in Proceedings of the Twelfth International Conference on Language Resources and Evaluation (LREC 2020), 2020. [38] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318, 2002. [39] E. Hosseini-Asl, B. McCann, C.-S. Wu, S. Yavuz, and R. Socher, “A simple language model for task-oriented dialogue,” Advances in Neural Information Processing Systems, vol. 33, pp. 20179–20191, 2020. [40] Y. Su et al., “Multi-task pre-training for plug-and-play task-oriented dialogue system,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), 2022. [41] Y. Yang, Y. Li, and X. Quan, “Ubar: Towards fully end-to-end task-oriented dialog system with gpt-2,” in Proceedings of the AAAI conference on artificial intelligence, pp. 14230–14238, 2021. [42] H. Touvron et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023. [43] G. Yenduri et al., “GPT (Generative Pre-Trained Transformer)—A comprehensive review on enabling technologies, potential applications, emerging challenges, and future directions,” IEEE Access, vol. 12, pp. 438849–438897, Apr. 2024. [44] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, pp. 4171–4186, 2019. |
| 論文全文使用權限 |
如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信