| 系統識別號 | U0002-0508202521095000 |
|---|---|
| DOI | 10.6846/tku202500669 |
| 論文名稱(中文) | 短影音行銷影片自動生成系統之設計與實作 |
| 論文名稱(英文) | Design and Implementation of an Automated Short Marketing Video Generation System |
| 第三語言論文名稱 | |
| 校院名稱 | 淡江大學 |
| 系所名稱(中文) | 資訊工程學系全英語碩士班 |
| 系所名稱(英文) | Master's Program, Department of Computer Science and Information Engineering (English-taught program) |
| 外國學位學校名稱 | |
| 外國學位學院名稱 | |
| 外國學位研究所名稱 | |
| 學年度 | 113 |
| 學期 | 2 |
| 出版年 | 114 |
| 研究生(中文) | 陳芃諭 |
| 研究生(英文) | Peng-Yu Chen |
| 學號 | 613780120 |
| 學位類別 | 碩士 |
| 語言別 | 英文 |
| 第二語言別 | |
| 口試日期 | 2025-06-08 |
| 論文頁數 | 99頁 |
| 口試委員 |
指導教授
-
張志勇( cychang@mail.tku.edu.tw)
共同指導教授 - 武士戎( wushihjung@mail.tku.edu.tw) 口試委員 - 張榮貴 口試委員 - 張義雄 |
| 關鍵字(中) |
多模態生成 提示工程 生成式AI 短影音行銷 影片生成 |
| 關鍵字(英) |
Multimodal Generation Prompt Engineering Generative AI Short-form Video Marketing Video Generation. |
| 第三語言關鍵字 | |
| 學科別分類 | |
| 中文摘要 |
本研究聚焦於「跨模態自動生成行銷影片」系統設計與實作,提出一套整合語言複製、畫面構建與音樂生成的完整流程。隨著多模態生成技術的快速發展,越來越多企業或平台希望藉由自動化方式將產品文案轉化為視覺行銷內容,提升內容產出效率與風格一致性。然而,目前多數影片生成方法在語意控制、模組整合與產業應用層面仍存有諸多挑戰,缺乏系統化的解法與可擴展的應用架構。本研究的核心目的即在於建立一個具語境一致性、多段結構控制能力,且適用於商業場景的影片生成系統。 目前相關研究雖涵蓋提示工程、文本檢索生成圖片與影片生成等三大模態技術,但仍面臨三項主要挑戰:(1)提示設計多聚焦於單一指令型或範例型輸入,缺乏段落結構與語境控制能力;(2)圖片生成模型普遍偏重單張畫面美感,無法針對商業視覺進行控制式生成;(3)影片生成方法則侷限於短秒數、單句輸入任務,無法處理腳本式敘事或多段落提示。此外,現有技術也多未能整合真實企業資料與產業需求,導致產出內容缺乏真實性與落地可行性。 為解決上述問題,本研究提出一套跨模態自動生成影片系統,分為三大模組:(1)腳本生成模組採用改良提示工程,整合CoT思維鍊、角色扮演與語意欄位格式化提示;(2)畫面構建模組中導入圖文檢索模型,透過自製資料集進行微調訓練,且使用SDXL圖片生成模型導入Food-IAC資料,以LoRA進行企業視覺微調;(3)影片畫面透過尾幀連續生成延長影片秒數。 本研究貢獻主要有五項:(1)建立具語意欄位與段落邏輯控制的提示工程架構,有效改善腳本生成的一致性與創造力;(2)設計多模態腳本至影片轉換流程,支援多段、語氣與畫面一致控制;(3)提出結合圖文檢索與SDXL微調之畫面檢索生成流程,強化商業視覺對應度;(4)動畫生成模組中設計「動畫尾幀連續生成」策略,延長影片秒數同時保持畫面順暢度;(5)提出影片整合對齊策略,使影片可以實際達成「全自動生成」之目的。 實驗設計方面,影片模組透過FVD、CLIPSIM與Content Consistency三項指標進行量化評估,在MSR-VTT與WebVid10M資料集上均優於現有T2V模型,展現跨段落、語氣與畫面一致性。進一步之消融分析亦指出,畫面生成模組若缺少特定呈現策略(如文字稿對應、動態畫面、品牌風格等),腳本內容評分將顯著下降,顯示本研究之整合模組設計為提升內容精準度之關鍵。 提示工程模組透過三種大型語言模型(GPT-4o、Claude Sonnet 4.0、Gemini 1.5)進行雙面向主觀評估,分別為「內容豐富度」與「劇情流暢度」,並呈現顯著優於基準方法之表現。在腳本提示消融實驗中,整合CoT與角色情境控制的提示設計取得高分。靜態畫面模組方面,圖文檢索與圖片生成模組皆證明為畫面豐富度的關鍵因子,缺一將導致視覺品質大幅下降。 |
| 英文摘要 |
This study focuses on the design and implementation of a "cross-modal automatic marketing video generation" system, proposing a complete pipeline that integrates language scripting, image construction, and music generation. With the rapid development of multimodal generation technologies, an increasing number of enterprises and platforms are seeking automated solutions to convert product descriptions into visual marketing content, aiming to enhance production efficiency and stylistic consistency. However, most current video generation approaches still face challenges in semantic control, module integration, and industrial applicability, lacking a systematic and scalable solution. The core objective of this research is to construct a video generation system that supports contextual coherence, multi-paragraph structure control, and is suitable for commercial scenarios. Although existing studies cover three major modalities—prompt engineering, text-conditioned image generation, and video generation—they still encounter three major limitations: (1) prompt design often focuses on single instruction-based or few-shot examples, lacking paragraph-level structural and contextual control; (2) image generation models typically emphasize aesthetics of a single frame, making it difficult to perform controllable generation aligned with commercial visual requirements; and (3) video generation methods are constrained by short duration and single-sentence inputs, making them unsuitable for structured scripting or multi-paragraph storytelling. Furthermore, existing techniques rarely incorporate real enterprise data or industrial demands, resulting in outputs lacking authenticity and practical feasibility. To address these challenges, this research proposes a cross-modal automatic video generation system consisting of three main modules: (1) a script generation module that adopts improved prompt engineering by integrating Chain-of-Thought (CoT) reasoning, role-playing, and semantic field formatting; (2) an image construction module that incorporates a fine-tuned image-text retrieval model trained on a custom dataset, and an SDXL image generation model refined using the Food-IAC dataset with LoRA for enterprise-specific visual adaptation; (3) a video generation strategy that extends video duration through a tail-frame recursive generation technique. This study contributes in five key areas: (1) it establishes a prompt engineering framework with semantic fields and paragraph logic control, improving consistency and creativity in script generation; (2) it designs a multimodal pipeline from script to video, supporting consistency across paragraphs, tone, and visuals; (3) it proposes a combined retrieval-generation process using CLIP and fine-tuned SDXL to enhance visual alignment with commercial requirements; (4) it introduces a “tail-frame recursive generation” strategy to extend video duration while maintaining visual coherence; (5) it develops a video alignment strategy that enables fully automatic generation across modalities. In the experimental design, the video module is evaluated using three quantitative metrics—FVD, CLIPSIM, and Content Consistency—on the MSR-VTT and WebVid10M datasets, showing performance superior to existing text-to-video (T2V) models in paragraph alignment, tonal consistency, and visual coherence. Ablation studies further reveal that omitting key strategies (e.g., subtitle matching, dynamic visuals, or brand style) significantly lowers content quality, validating the importance of the integrated module design in achieving semantic precision. The prompt engineering module was evaluated subjectively by three large language models (GPT-4o, Claude Sonnet 4.0, and Gemini 1.5) across two dimensions: content richness and narrative fluency. Results demonstrate that the proposed design—integrating CoT reasoning with contextual role prompts—achieved significantly higher scores than baseline methods. Regarding static visual quality, both the image retrieval and generation modules were shown to be essential: removing either leads to a substantial decline in visual diversity and quality. Together, these results confirm that the system's integrated architecture plays a critical role in producing coherent, high-quality, and commercially aligned short-form marketing videos. |
| 第三語言摘要 | |
| 論文目次 |
LIST OF CONTENT LIST OF CONTENT VIII LIST OF FIGURE XI LIST OF TABLE XIII CHAPTER 1 INTRODUCTION 1 1-1 Background Introduction 1 1-2 Positioning of This Thesis 2 1-2-1 Gap Between Generated Visuals and Business Needs 3 1-2-2 Difficulty Maintaining Visual Continuity 3 1-2-3 Difficulty in Semantic Alignment Between Script and Visuals 4 1-2-4 Complexity of Multimodal Integration 4 1-3 Research Methodology 4 1-4 Research Contributions 6 1-4-1 Combining Image-Text Retrieval and Generative Models to Balance Corporate Authenticity and Visual Quality 6 1-4-2 Proposing an Animation End-Frame Recursive Generation Mechanism to Overcome Generation Time Limitations 6 1-4-3 Enhancing Script Semantic Controllability through Chain-of-Thought Reasoning and Role-Playing Prompts 7 1-4-4 Introducing a Voice-Guided Multimodal Alignment Strategy 7 Chapter 2 Related Work 9 2-1 Multimodal Content Generation and Integration 9 2-1-1 Prompt Engineering and Language Generation Control Technology 9 2-1-2 Static Image Generation Technology 13 2-1-3 Dynamic Video Generation Technology 17 2-2 Overview 20 Chapter 3 Background Knowledge 22 3-1 CLIP(Contrastive Language-Image Pretraining) 22 3-2 LLM(Large Language Model) 24 3-3 SDXL(Stable Diffusion XL) 25 3-4 LoRA(Low-Rank Adaptation) 26 3-5 SVD (Stable Video Diffusion) 28 3-6 Instruct Pix2Pix 29 3-7 F5-TTS(Fakes Fluent and Faithful Speech) 31 3-8 MusicGen 32 Chapter 4 System Design 34 4-1 Data Collection and Preprocessing 35 4-2 Image-Text Retrieval Training 42 4-2-1 Secondary Data Augmentation 42 4-2-2 Fine-Tuning Phase of the Chinese CLIP Model 43 4-3 Chain-of-Thought Storyboard Design for Script Generation 44 4-4 Extracting Voice Features and Generating Speech 47 4-5 Background Music Generation 48 4-6 Image Generation and Fine-tuning Module 49 4-6-1 Dish CLIP Image-Text Retrieval Application Phase 50 4-6-2 SDXL + LoRA Image Generation Training Phase 52 4-6-3 SDXL + LoRA Image Generation (Inference Phase) 54 4-7 Dynamic Scene Generation 55 4-7-1 Animation Generation Technique 56 4-7-2 Animation Tail-Frame Recursive Generation Design 57 4-8 Video Alignment Strategy 59 Chapter 5 Experimental Analysis 63 5-1 Dataset 63 5-2 Environment and System Parameter Settings 65 5-3 Experimental Results 65 5-3-1 Video Frame Retrieval, Generation, and Temporal Fluency 66 5-3-2 Evaluation of Prompt Engineering Using LLM 74 5-3-3 Ablation Study 85 Chapter 6 Conclusion 93 REFERENCE 96 LIST OF FIGURE Figure 1. CLIP [20] Model Architecture Diagram 23 Figure 2. SDXL [23] Model Architecture Diagram 26 Figure 3. LoRA [24] Model Architecture Diagram 27 Figure 4. Architecture of the Stable Video Diffusion (SVD) [25] Model 28 Figure 5. InstructPix2Pix [26] Training Pipeline 30 Figure 6. F5-TTS [27] Model Architecture Diagram 32 Figure 7. System Architecture Diagram of the Proposed Framework 34 Figure 8. Matplotlib Annotation of Dish Name Location in Image 37 Figure 9. Preprocessing of Dish Images 38 Figure 10. Generation of Company Information Document 40 Figure 11. User Input Data Preprocessing 41 Figure 12. Training Phase of CLIP-based Text-Image Retrieval for Dishes 44 Figure 13. Chain-of-Thought-Based Prompt Design for Script Generation 45 Figure 14. Voiceover Speech Generation Module 48 Figure 15. CLIP-Based Image-Text Retrieval Phase for Dish Selection 51 Figure 16. Structure of the Food-IAC dataset [31] 53 Figure 17. SDXL + LoRA Fine-tuning Workflow 54 Figure 18. Workflow of Image Generation Using Fine-tuned SDXL Model 55 Figure 19. Initial Animation Video Generation Design 58 Figure 20. Process Flow of Recursive Last-Frame Video Generation 59 Figure 21. Voice-over Segment Generation 60 Figure 22. Background Music Alignment 60 Figure 23. Static Image Generation per Segment 61 Figure 24. Animation Alignment Based on Narration Duration 62 Figure 25. Structure of the MSR-VTT Dataset [29] 64 Figure 26. Structure of the WebVid-10M Dataset [30] 64 Figure 27. Loss Curve of CLIP Text-Image Retrieval Model During Training 69 Figure 28. Loss Variation During SDXL Text-to-Image Model Training 70 Figure 29. Visualization Results of Different Prompt Engineering Strategies 77 Figure 30. Box Plot of Content Richness 81 Table 31. Heatmap of Content Richness 83 Figure 32. Box Plot of Narrative Coherence 84 Figure 33. Heatmap of Narrative Coherence 85 Figure 34. Box Plot of Visual Richness in Ablation Study 90 Table 35. Bar Chart of Script content richness in Ablation Study 92 LIST OF TABLE Table 1.Comparison of Related Studies on Prompt Engineering 20 Table 2. Comparison of Related Studies on Static Image Generation 21 Table 3. Comparison of Related Studies on Dynamic Video Generation 21 Table 4. System Experimental Environment 65 Table 5. Configuration of Parameters for Marketing Video Frame Generation 67 Table 6. Comparison of Visual Quality and Temporal Smoothness 73 Table 7. Comparative Evaluation of Prompt Engineering Strategies by LLM 78 Table 8. Statistical Summary of Content Richness 80 Table 9. Pearson Correlation Coefficients for Content Richness 82 Table 10. Statistical Summary of Narrative Coherence 83 Table 11. Statistical Summary of Narrative Coherence 85 Table 12. Ablation Study on System Technical by LLMs 86 Table 13. Statistical of Ablation Study (Visual Richness) 88 Table 14. Statistical of Ablation Study (Script Content Richness) 90 |
| 參考文獻 |
[1] Jason Wei, Xuezhi Wang et al. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” NeurIPS 2022, 2022. [2] Shizhe Diao, Pengcheng Wang et al. “Active Prompting with Chain-of-Thought for Large Language Models.” In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, vol 1: Long Papers, pp. 1330–1350, 2024. [3] J. Kim et al. “Persona is a Double edged Sword: Mitigating the Negative Impact of Role playing Prompts in Zero shot Reasoning Tasks.” ACL ARR, 2024. [4] Xavier Amatriain. “Prompt Design and Engineering: Introduction and Advanced Methods.” arXiv preprint, arXiv:2401.14423, 2024. [5] A. Ramesh et al. “Zero Shot Text to Image Generation.” Proc. ICML, vol. 139, pp. 8821–8831, 2021. [6] C. Saharia, W. Chan et al. “Photorealistic Text to Image Diffusion Models with Deep Language Understanding.” Advances in Neural Information Processing Systems, vol. 35, NeurIPS 2022, 2022. [7] H. Chefer, Y. Alaluf et al. “Attend‑and‑Excite: Attention‑Based Semantic Guidance for Text‑to‑Image Diffusion Models.” ACM Trans. Graph., vol. 42, no. 4, art. 1, 2023. [8] Z. Yang, J. Wang, Z. Gan et al. “ReCo: Region Controlled Text to Image Generation.” in Proc. IEEE/CVF Conf. Comp. Vision & Pattern Recog. (CVPR), pp. 14246–14255, 2023. [9] L. Han, Y. Li, et al. “SVDiff: Compact parameter space for diffusion fine tuning.” in Proc. IEEE/CVF Int. Conf. Computer Vision (ICCV), Paris, France, pp. 12678–12687, 2023. [10] U. Singer et al. “Make A Video: Text to Video Generation without Text Video Data.” presented at ICLR 2023, 2023. [11] Wu. C, Li. L et al. “GODIVA: Generating open-domain videos from natural descriptions.” In Advances in Neural Information Processing Systems, 35. NeurIPS, 2022. [12] W. Hong, M. Ding et al. “CogVideo: Large scale pretraining for text to video generation via transformers.” in Proc. ICLR, 2023. [13] M. Ding, Z. Yang et al. “CogView: Mastering Text to Image Generation via Transformers.” in Proc. NeurIPS, 2021. [14] Y. He, T. Yang et al. “Latent Video Diffusion Models for High Fidelity Long Video Generation.” arXiv preprint, arXiv:2211.13221, 2022. [15] Rombach, Robin et al. “High-Resolution Image Synthesis with Latent Diffusion Models.” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10674-10685, 2021. [16] J. Wang, H. Yuan et al. “ModelScope Text to Video Technical Report.” arXiv preprint arXiv:2308.06571, 2023. [17] Liu, Yixin et al. “BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models.” in Proc. 40th Int. Conf. Machine Learning (ICML), Honolulu, HI, pp. 120–135, 2023. [18] Z. Xing, Q. Dai et al. “SimDA: Simple Diffusion Adapter for Efficient Video Generation.” in Proc. IEEE/CVF Conf. Computer Vision & Pattern Recognition (CVPR), pp. 7827–7839, 2024. [19] P. Ruan, P. Wang et al. “DEMO: Enhancing Motion in Text to Video Generation with Decomposed Encoder and Conditioning.” presented at NeurIPS 2024 Dynamo Demonstrations, 2024. [20] Radford, A., Kim, J. W. et al. “Learning Transferable Visual Models From Natural Language Supervision.” arXiv preprint, arXiv:2103.00020, 2021. [21] Brown, T. B., Mann, B. et al. “Language models are few-shot learners.” in Advances in Neural Information Processing Systems (NeurIPS 2020), 33, 1877–1901, 2020. [22] M. Xia, T. Gao et al. “SHEARED LLaMA: Accelerating Language Model Pre‑Training via Structured Pruning.” in Proc. 12th Int. Conf. Learning Representations (ICLR), Vienna, Austria, 2024. [23] D. Podell et al. “SDXL: Improving Latent Diffusion Models for High Resolution Image Synthesis.” presented at ICLR 2024 Spotlight Poster, 2024. [24] E. J. Hu, Y. Shen et al. “LoRA: Low‑Rank Adaptation of Large Language Models.” in Proc. Int. Conf. Learning Representations (ICLR), 2022. [25] A. Blattmann et al. “Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets.” Stability AI technical report, 2023. [26] T. Brooks, A. Holynski et al. “InstructPix2Pix: Learning to follow image editing instructions.” in Proc. IEEE/CVF Conf. Computer Vision & Pattern Recognition (CVPR), 2023. [27] Y. Chen, Z. Niu et al. “F5 TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching.” arXiv preprint, arXiv:2410.06885, 2024. [28] J. Copet, F. Kreuk et al. “Simple and Controllable Music Generation.” arXiv preprint arXiv:2306.0528, 2023. [29] J. Xu, T. Mei et al. “M MSR VTT: A large video description dataset for bridging video and language.” in Proc. IEEE/CVF Conf. Computer Vision & Pattern Recognition (CVPR), pp. 5288–5296, 2016. [30] Bain, M., Nagrani et al. “Frozen in Time: A Joint Video and Image Encoder for End to End Retrieval.” In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2021. [31] Renovamen et al. “To be an Artist: Automatic Generation on Food Image Aesthetic Captioning.” Proc. IEEE Int. Conf. Tools with Artificial Intelligence (ICTAI), 2020. [32] T. Unterthiner, S. van Steenkiste et al. “FVD: A New Metric for Video Generation.” Workshop paper at International Conference on Learning Representations (ICLR), 2019. [33] T. Kou, X. Liu et al. “Subjective-Aligned Dataset and Metric for Text-to-Video Quality Assessment.” in Proc. ACM Int. Conf. Multimedia (ACM MM), Melbourne, Australia, 2024. [34] Y. Liu, X. Cun et al. “EvalCrafter: Benchmarking and Evaluating Large Video Generation Models.” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2024. [35] T. Lee et al. “Grid Diffusion Models for Text to Video Generation.” in Proc. IEEE/CVF Conf. Computer Vision & Pattern Recognition (CVPR), 2024. [36] Y. Guo, C. Yang et al. “AnimateDiff: Animate Your Personalized Text‑to‑Image Diffusion Models without Specific Tuning.” in Proc. 12th Int. Conf. Learning Representations (ICLR), 2024. [37] Z. Luo et al.“VideoFusion: Decomposed Diffusion Models for High Quality Video Generation” in Proc. IEEE/CVF Conf. Computer Vision & Pattern Recognition (CVPR), 2023. [38] X. Wang, S. Zhang et al. “A Recipe for Scaling up Text to Video Generation with Text free Videos.”in Proc. IEEE/CVF Conf. Computer Vision & Pattern Recognition (CVPR), 2024. [39] J. Ho, W. Chan et al.“Imagen Video: High Definition Video Generation with Diffusion Models.”in Proc. Advances in Neural Information Processing Systems (NeurIPS), vol. 35, 2022. [40] S. Mittal, S. C. Raparthy et al.“Compositional Attention: Disentangling Search and Retrieval in Transformers.” in Proc. Int. Conf. Learning Representations (ICLR), 2022. [41] Li, M et al.“TrOCR: Transformer-based Optical Character Recognition.” In Proceedings of AAAI, 2023. [42] Chowdhery et al.“PaLM: Scaling Language Modeling with Pathways.” arXiv:2204.02311, 2022. [43] C. Raffel, N. Shazeer et al.“Exploring the limits of transfer learning with a unified text-to-text transformer.” J. Mach. Learn. Res., vol. 21, no. 140, pp. 1–67, 2020. [44] B. Peng, M. Galley et al.“Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback.” Microsoft Research technical report, Mar, 2023. [45] A. Vaswani, N. Shazeer et al.“Attention is all you need.” in Advances in Neural Information Processing Systems (NeurIPS), vol. 30, 2017. [46] A. Vaswani, N. Shazeer et al.“Hierarchical Text-Conditional Image Generation with CLIP Latents.” in Advances in Neural Information Processing Systems(NeurIPS 2022), vol. 35, 2022. [47] D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” in Proc. Int. Conf. Learning Representations (ICLR), 2014. [48] Wyzowl, “Video Marketing Statistics 2023,” Wyzowl, 2023. [49] K. Tan, “8‑Second Attention Spans and Short Videos: The Future of Video‑First Marketing,” Retail TouchPoints, Oct. 13 2023. [50] ] A. Dosovitskiy, L. Beyer et al.“An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” in Proc. Int. Conf. Learning Representations (ICLR), 2021. [51] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016. [52] OpenAI, “GPT‑4o System Card,” OpenAI System Card, Aug. 8 2024. [53] Gemini Team, “Gemini 1.5: Unlocking Multimodal Understanding across Millions of Tokens of Context,” Google DeepMind technical report, Feb. 2024. [54] Anthropic, “System Card: Claude Opus 4 & Claude Sonnet 4,” Anthropic system card, May 22 2025. [55] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional Networks for Biomedical Image Segmentation,” in Proc. Int. Conf. Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2015, pp. 234–241. [56] C. Raffel, N. Shazeer et al. “Exploring the limits of transfer learning with a unified text-to-text transformer,” J. Mach. Learn. Res., vol. 21, no. 140, pp. 1–67, 2020. [57] Y. LeCun, L. Bottou, Y. Bengio et al. “Gradient-based learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2324, Nov. 1998. [58] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in Proc. Advances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 6840–6851, 2020. [59] P. Isola, J. Y. Zhu et al. “Image-to-image translation with conditional adversarial networks,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pp. 5967–5976, 2017. [60] A. Hertz, R. Mokady et al. “Prompt-to-prompt image editing with cross attention control,” ACM Trans. Graph., vol. 42, no. 4, Art. no. 99, 2023. [61] W. Peebles, J. T. Barron, A. Dosovitskiy, A. Nichol, and P. Dhariwal, “DiT: Diffusion models beat GANs on image synthesis,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 36, 2023. [62] J. Lu, H. Yu et al. “A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding,” in Findings of the Association for Computational Linguistics: ACL 2025, Vienna, Austria, pp. 7252–7273, Jul. 2025. [63] Microsoft, “Recognize printed and handwritten text with OCR,” Microsoft Learn, May 2024. |
| 論文全文使用權限 |
如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信