§ 瀏覽學位論文書目資料
系統識別號 U0002-2708202414162700
DOI 10.6846/tku202400724
論文名稱(中文) 生成式AI驅動的ESG永續企業行銷短影片自動生成系統
論文名稱(英文) Generative AI-Driven Automated Marketing Short Video Production System for ESG Sustainable Enterprises
第三語言論文名稱
校院名稱 淡江大學
系所名稱(中文) 資訊工程學系碩士班
系所名稱(英文) Department of Computer Science and Information Engineering
外國學位學校名稱
外國學位學院名稱
外國學位研究所名稱
學年度 112
學期 2
出版年 113
研究生(中文) 林婕涵
研究生(英文) JIE-HAN LIN
學號 612410034
學位類別 碩士
語言別 繁體中文
第二語言別
口試日期 2024-07-02
論文頁數 86頁
口試委員 指導教授 - 張志勇(cychang@mail.tku.edu.tw)
口試委員 - 廖文華
口試委員 - 蒯思齊
共同指導教授 - 郭經華(chkuo@mail.tku.edu.tw)
關鍵字(中) ESG
生成式AI
影片生成
多模態生成
檢索增強生成
短影音行銷
關鍵字(英) ESG
Generative AI
Video Generation
Multimodal Generation
Retrieval-Augmented Generation
Short Video Marketing
第三語言關鍵字
學科別分類
中文摘要
隨著企業對環境、社會及公司治理(ESG)議題的重視日益增加,如何有效傳遞ESG理念並提升永續企業的市場競爭力已成為一項緊迫的課題。近年來,行銷模式經歷了顯著變革,消費者不僅僅是被動地按讚和分享,更能夠透過商家的獎勵機制,主動參與行銷影片的製作。這種由消費者自發性為綠色商家行銷的模式,能夠以更快的速度和更廣的範圍觸及目標受眾。然而,消費者在嘗試製作精準傳達ESG理念的行銷影片時,往往面臨語境資料不足、影片故事性欠缺,以及視覺內容無法充分體現ESG概念等挑戰。為因應這些挑戰,本論文旨在開發一個「生成式AI驅動的ESG永續企業行銷短影片自動生成系統」,藉由生成式AI技術在行銷與數位內容創作中的潛力,協助廣大消費者輕鬆參與綠色商家的ESG行銷影片創作。該系統包含六個主要階段:檢索增強生成(RAG)、行銷影片腳本生成、ESG關鍵詞標註、影片畫面生成、旁白語音生成以及背景音樂生成。
首先,本系統將優化行銷影片腳本的生成過程。透過微調語言模型,系統強化了行銷影片的故事情境、製作理論及ESG專業知識。接著,系統透過檢索增強生成技術(RAG),動態檢索企業的ESG報告書內容作為參考資料,並將其融入影片腳本的生成過程中,使最終生成的腳本包含詳細的故事情節描述、逐字稿及分鏡圖。在影片畫面生成部分,本研究訓練基於CLIP模型架構的ESG關鍵詞標註器,對消費者輸入的圖片進行描述,並對報告書及消費者提供的圖片和影片進行半自動化標註,這些標註資料將用於微調SDXL模型,以生成符合ESG相關聯的精緻圖片。圖片將進一步透過AnimateDiff技術製作成動畫,形成連貫的動態視覺內容。在旁白語音與背景音樂的生成方面,本研究使用Transformer TTS技術,不僅能生成自然流暢的旁白語音,確保影片內容能夠清晰、準確地傳遞資訊,還能根據影片的主題和故事情境生成風格高度契合的背景音樂。
在影片質量評估方面,本研究在Zero-Shot環境下引入Fréchet Video Distance (FVD)和CLIPSIM指標,分別量化影片品質及畫面與文本描述的語義匹配度。實驗結果顯示,所提出的系統在Zero-Shot環境下的表現優於現有同性質研究,表明本研究的系統能夠生成質感逼真且語義匹配度高的影片。本研究透過在系統設計、微調策略及訓練資料選擇上的創新與優化,顯示出生成式AI在ESG行銷影片創作中的巨大潛力,為未來的ESG行銷影片生成研究提供了實踐指引。
英文摘要
As companies increasingly emphasize Environmental, Social, and Governance (ESG) issues, effectively conveying ESG principles and enhancing the market competitiveness of sustainable enterprises have become pressing challenges. In recent years, marketing strategies have undergone significant transformation, with consumers no longer merely passive participants in liking and sharing content. Instead, they are now actively involved in creating marketing videos through incentive mechanisms provided by businesses. This consumer-driven marketing approach, where users autonomously promote green businesses, has the potential to reach target audiences more quickly and broadly. However, when consumers attempt to create marketing videos that accurately convey the ESG principles of green businesses, they often face challenges such as insufficient contextual data, a lack of compelling narratives, and visual content that fails to adequately represent ESG concepts. To address these challenges, this paper aims to develop a "Generative AI-Driven ESG Sustainable Enterprise Marketing Video Automatic Generation System." By leveraging the potential of generative AI technology in marketing and digital content creation, this system assists consumers in easily participating in the creation of ESG marketing videos for green businesses. The system comprises six major phases: Retrieval-Augmented Generation (RAG), marketing video script generation, ESG keyword CLIP, video screen generation, voiceover generation, and background music generation.
Initially, the system focuses on optimizing the generation of marketing video scripts. By fine-tuning language models, the system enhances the narrative context, theoretical knowledge of video production, and expertise in ESG. Subsequently, the system utilizes Retrieval-Augmented Generation (RAG) to dynamically retrieve ESG report content from companies as reference material, integrating it into the video script generation process. This ensures that the final script includes detailed story descriptions, transcripts, and storyboards. In the video content generation phase, this study trains an ESG keyword annotator based on the CLIP model architecture to describe images provided by consumers and to semi-automatically annotate images and videos from reports and consumer submissions. These annotated data are then used to fine-tune the SDXL model, generating high-quality images aligned with ESG themes. These images are further transformed into coherent dynamic visual content using AnimateDiff technology. For voiceover and background music generation, this study employs Transformer TTS technology to produce natural and fluent voiceovers, ensuring that the video content is conveyed clearly and accurately. Additionally, background music is generated to match the video's theme and narrative context, ensuring a high degree of stylistic coherence.
In terms of video quality assessment, this study introduces the Fréchet Video Distance (FVD) and CLIPSIM metrics in a Zero-Shot environment to quantify video quality and assess the semantic alignment between the visual content and its textual description. Experimental results demonstrate that the proposed system outperforms existing comparable studies in a Zero-Shot environment, indicating that the system can generate videos with realistic texture and high semantic alignment. Through innovations and optimizations in system design, fine-tuning strategies, and training data selection, this study showcases the immense potential of generative AI in the creation of ESG marketing videos, providing practical guidance for future research in ESG marketing video generation.
第三語言摘要
論文目次
目錄
誌謝	I
目錄	VII
圖目錄	X
表目錄	XII
第一章 簡介	1
1-1背景介紹	1
1-2相關研究回顧	2
1-3本論文定位	6
1-4研究方法	8
1-5研究貢獻	10
第二章 相關研究	12
2-1 多模態內容的生成與整合	12
2-1-1自然語言生成(NLG)	12
2-1-2 視覺內容生成	14
2-1-3 音訊生成	17
2-1-4 多模態生成	19
第三章 背景知識	24
Transformer [3]	24
SBERT(Sentence-BERT)[34]	26
GPT-4 [35]	27
RAG[45]	29
CLIP [25]	30
SDXL(Stable Diffusion XL)[15]	31
AnimateDiff [36]	32
Transformer TTS [24]	33
第四章 系統設計	35
4-1資料蒐集與前處理	36
4-1-1資料蒐集	36
4-1-2資料前處理	38
4-2檢索增強生成(Retrieval-Augmented Generation, RAG)	44
4-2-1 建置SBERT向量轉換器	45
4-2-2建立向量資料庫	47
4-3行銷影片腳本生成(Marketing Video Script Generation)	48
4-3-1建構LLM微調資料集(Instruction Fine-tuning Dataset)	49
4-3-2 Implementing LoRA Fine-Tuning TAIDE LLM	55
4-3-3 思維練提示生成(Chain-Of-Thought Prompting Generation)	57
4-4 ESG關鍵詞標註器(ESG Keyword CLIP)	58
4-5影片畫面生成(Video Screen Generation)	60
4-5-1文生圖生成系統	60
4-5-2圖文比對系統	62
4-5-3動態畫面生成	63
4-6旁白語音生成(Narration Voice Generation)	65
4-7背景音樂生成(Background Music Generation)	66
4-7-1音樂風格描述	66
4-7-2背景音樂生成	66
第五章 實驗分析	68
5-1 資料集	68
5-2 環境與系統參數設定	69
5-3實驗結果	69
5-3-1行銷影片腳本生成能力	69
5-3-2影片畫面在ESG主題上的實現程度	75
5-3-3影片質量評估	77
第六章 結論	80
參考文獻	82

圖目錄
圖 1、論文系統架構圖	8
圖 2、Transformer模型架構圖	25
圖 3、Sentence-BERT模型架構圖	26
圖 4、GPT-4在九個類別的內部事實性評估結果	28
圖 5、RAG方法架構圖	29
圖 6、CLIP運作原理	30
圖 7、SDXL模型架構	31
圖 8、AnimateDiff模型訓練流程圖	33
圖 9、Transformer TTS模型架構圖	34
圖 10、論文系統架構圖	35
圖 11、文本正規化	39
圖 12、表格資料提取與分析	41
圖 13、圖片資料提取與描述	42
圖 14、ESG關鍵詞庫圖示	44
圖 15、SBERT向量轉換器訓練圖示	45
圖 16、Label機制強化報告書內容關聯性	46
圖 17、建立向量資料庫圖示	47
圖 18、消費者使用期示意圖	48
圖 19、行銷影片製作理論知識微調資料建置	51
圖 20、YouTube行銷影片敘事情境設計微調資料建置	53
圖 21、ESG專業知識微調資料建置	54
圖 22、LoRA Fine-Tuning TAIDE-LX-7B LLM模型架構	55
圖 23、行銷影片腳本生成模型訓練架構圖	56
圖 24、企業ESG內容提取	57
圖 25、行銷策略制定	58
圖 26、逐步生成行銷影片腳本	58
圖 27、ESG關鍵詞標註器模型訓練架構圖	59
圖 28、透過LoHA方式微調SDXL模型架構圖	61
圖 29、文生圖生成系統訓練架構圖	62
圖 30、圖文比對系統訓練架構圖	63
圖 31、動態畫面生成訓練架構圖	64
圖 32、音樂風格描述示例圖	66
圖 33、行銷影片腳本生成模型的 Loss 變化	71
圖 34、GPT-4評分三軸圖	73
圖 35、模型微調前後比較圖	74
圖 36、基於WebVid-10M資料集文本與ESG關鍵詞結合之影片生成樣本	76

表目錄
表 1、行銷影片腳本生成相關研究比較表	22
表 2、視覺內容生成相關研究比較表	23
表 3、多模態生成相關研究比較表	23
表 4、系統實驗環境	69
表 5、行銷影片腳本生成參數設定表	70
表 6、台灣上市櫃公司產業類別分類表	72
表 7、LoHa微調 Stable-Diffusion-XL-Base-1.0參數表	75
表 8、Zero-Shot 環境下的 FVD 和 CLIPSIM 指標比較表	79
參考文獻
參考文獻
[1]	Rumelhart, David E. et al. “Learning representations by back-propagating errors.” Nature 323 (1986): 533-536.
[2]	Hochreiter, Sepp and Jürgen Schmidhuber. “Long Short-Term Memory.” Neural Computation 9 (1997): 1735-1780.
[3]	Vaswani, Ashish et al. “Attention is All you Need.” Neural Information Processing Systems (2017).
[4]	Radford, Alec and Karthik Narasimhan. “Improving Language Understanding by Generative Pre-Training.” (2018).
[5]	Devlin, Jacob et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” North American Chapter of the Association for Computational Linguistics (2019).
[6]	Radford, Alec et al. “Language Models are Unsupervised Multitask Learners.” (2019).
[7]	Brown, Tom B. et al. “Language Models are Few-Shot Learners.” ArXiv abs/2005.14165 (2020).
[8]	Kingma, Diederik P. and Max Welling. “Auto-Encoding Variational Bayes.” CoRR abs/1312.6114 (2013).
[9]	Goodfellow, Ian J. et al. “Generative adversarial networks.” Communications of the ACM 63 (2014): 139 - 144.
[10]	Karras, Tero et al. “Progressive Growing of GANs for Improved Quality, Stability, and Variation.” ArXiv abs/1710.10196 (2017).
[11]	Brock, Andrew et al. “Large Scale GAN Training for High Fidelity Natural Image Synthesis.” ArXiv abs/1809.11096 (2018).
[12]	Karras, Tero et al. “A Style-Based Generator Architecture for Generative Adversarial Networks.” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2018): 4396-4405.
[13]	Ho, Jonathan et al. “Denoising Diffusion Probabilistic Models.” ArXiv abs/2006.11239 (2020).
[14]	Rombach, Robin et al. “High-Resolution Image Synthesis with Latent Diffusion Models.” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021): 10674-10685.
[15]	Podell, Dustin et al. “SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis.” ArXiv abs/2307.01952 (2023).
[16]	Peebles, William S. and Saining Xie. “Scalable Diffusion Models with Transformers.” 2023 IEEE/CVF International Conference on Computer Vision (ICCV) (2022): 4172-4182.
[17]	Liu, Yixin et al. “Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models.” ArXiv abs/2402.17177 (2024).
[18]	Dosovitskiy, Alexey et al. “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.” ArXiv abs/2010.11929 (2020).
[19]	Black, Alan W. and Paul A. Taylor. “Automatically clustering similar units for unit selection in speech synthesis.” EUROSPEECH (1997).
[20]	Rabiner, Lawrence R.. “A tutorial on hidden Markov models and selected applications in speech recognition.” Proc. IEEE 77 (1989): 257-286.
[21]	Oord, Aäron van den et al. “WaveNet: A Generative Model for Raw Audio.” Speech Synthesis Workshop (2016).
[22]	Donahue, Chris et al. “Adversarial Audio Synthesis.” International Conference on Learning Representations (2018).
[23]	Kaneko, Takuhiro et al. “Cyclegan-VC2: Improved Cyclegan-based Non-parallel Voice Conversion.” ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2019): 6820-6824.
[24]	Li, Naihan et al. “Neural Speech Synthesis with Transformer Network.” AAAI Conference on Artificial Intelligence (2018).
[25]	Radford, Alec et al. “Learning Transferable Visual Models From Natural Language Supervision.” International Conference on Machine Learning (2021).
[26]	Li, Junnan et al. “Align before Fuse: Vision and Language Representation Learning with Momentum Distillation.” Neural Information Processing Systems (2021).
[27]	Wang, Wenhui et al. “VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts.” ArXiv abs/2111.02358 (2021).
[28]	Li, Junnan et al. “BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation.” International Conference on Machine Learning (2022).
[29]	Zheng, Kaizhi et al. “MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens.” ArXiv abs/2310.02239 (2023).
[30]	Zhang, Yiyuan et al. “Meta-Transformer: A Unified Framework for Multimodal Learning.” ArXiv abs/2307.10802 (2023).
[31]	Hu, J. Edward et al. “LoRA: Low-Rank Adaptation of Large Language Models.” ArXiv abs/2106.09685 (2021).
[32]	“Taide/Taide-LX-7B · Hugging Face.” Taide/TAIDE-LX-7B · Hugging Face, huggingface.co/taide/TAIDE-LX-7B.
[33]	He, Kaiming et al. “Deep Residual Learning for Image Recognition.” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015): 770-778.
[34]	Reimers, Nils and Iryna Gurevych. “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.” Conference on Empirical Methods in Natural Language Processing (2019).
[35]	Achiam, OpenAI Josh et al. “GPT-4 Technical Report.” (2023).
[36]	Guo, Yuwei et al. “AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning.” ArXiv abs/2307.04725 (2023).
[37]	“Twse 公司治理中心.” TWSE 公司治理中心, cgc.twse.com.tw/.
[38]	“上市櫃公司永續報告書.” TWSE 公司治理中心, cgc.twse.com.tw/front/chPage.
[39]	Google Maps, Google, maps.google.com/.
[40]	“The Global Leader for Impact Reporting.” GRI - Home, www.globalreporting.org/. 
[41]	“SASB Standards Overview.” SASB, 15 Aug. 2023, sasb.ifrs.org/standards.
[42]	PatrickFarley. “OCR - Optical Character Recognition - Azure AI Services.” Azure AI Services | Microsoft Learn, learn.microsoft.com/en-us/azure/ai-services/computer-vision/overview-ocr. Accessed 18 Aug. 2024. 
[43]	Jsvine. “Jsvine/Pdfplumber: Plumb a PDF for Detailed Information about Each Char, Rectangle, Line, et Cetera - and Easily Extract Text and Tables.” GitHub, github.com/jsvine/pdfplumber.
[44]	Betker, James et al. “Improving Image Generation with Better Captions.” .
[45]	Lewis, Patrick et al. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” ArXiv abs/2005.11401 (2020).
[46]	Draganov, Andrew et al. “The Hidden Pitfalls of the Cosine Similarity Loss.” ArXiv abs/2406.16468 (2024).
[47]	“Vector Database.” Qdrant, qdrant.tech/.
[48]	Peng, Baolin et al. “Instruction Tuning with GPT-4.” ArXiv abs/2304.03277 (2023).
[49]	Instruction-Tuning-with-GPT-4. “Instruction-Tuning-with-GPT-4/GPT-4-LLM: Instruction Tuning with GPT-4.” , 2023, GitHub, github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM. 
[50]	Rohan Taori, et al. “Stanford Alpaca: An Instruction-following LLaMA model”, 2023, Github, https://github.com/tatsu-lab/stanford_alpaca.
[51]	Kelley, L. D., Jugenheimer, D. W., & Sheehan, K. (2012). Advertising media planning: A brand management approach. Armonk, N.Y.: M.E. Sharpe.
[52]	陳萬達。媒體企劃:跨媒體行銷趨勢與傳播策略。新北市:威仕曼文化事業股份有限公司, 2012。
[53]	戴國良。圖解式成功撰寫行銷企劃案(三版)。台北市:書泉出版社, 2012。
[54]	王彩雲。行銷傳播產業100問。台北市:動腦傳播股份有限公司, 2012。
[55]	明石岳人。短影音聖經: 社群行銷100鐵則, 絕對瘋傳又賣爆! 野人文化出版 : 遠足文化發行, 2024。
[56]	Lessard, Tyler. 最強影片行銷71堂課: 紐約時報讚譽的網路行銷大師, 教你完美運用影片拓展銷售藍海. 城邦文化事業股份有限公司-商業周刊, 2021.
[57]	台積電。“台積電 - 創造一個更智慧,更美麗的新世界” YouTube , uploaded by 台積電, 12 June 2024, https://youtu.be/EQ2k9QTE-Nk?si=phNCIaH6rAOBXmSt.
[58]	鴻海。“ESG回顧/鴻海永續經營發展 邁向未來無限可能” YouTube , uploaded by 鴻海, 28 June 2023,https://youtu.be/rNEbEWsusMo?si=QJhtk9XquCo29tfy.
[59]	中華電信。“中華電信 |永續形象影片 攜手篇” YouTube , uploaded by 中華電信, 22 April 2024, https://youtu.be/7dgOOVa1-ME?si=hVCwM1tuPh6wKUFP.
[60]	王品集團。“王品集團|2024年集團歌” YouTube , uploaded by 王品集團, 4 June 2024, https://youtu.be/faEmLaRNy4s?si=J6L0t8NE_zx01vbg.
[61]	經貿!了解一下。“不減碳就沒訂單!「碳盤查」檢視公司碳排放熱點【ESG永續台灣】EP01” YouTube , uploaded by 經貿!了解一下, 20 January 2022, https://youtu.be/lJ_oJPsZQnY?si=fItJBjfv942p3K5i.
[62]	經貿!了解一下。“全球碳排壁壘四起! 台灣「氣候變遷因應法」來了!【ESG永續台灣】EP02” YouTube , uploaded by 經貿!了解一下, 14 April 2022, https://youtu.be/SrA900WEFxE?si=Ehxp368i0l-qiOO9.
[63]	經貿!了解一下。“還在睡? 企業主的緊急要事!台灣綠電交易正在發生中【ESG永續台灣】EP03” YouTube , uploaded by 經貿!了解一下, 23 June 2022, https://youtu.be/EqSKJDohVzE?si=LVe0FVUg7B2s4McC.
[64]	經貿!了解一下。“碳排變綠金!正港欸台灣技術現在國外搶著問-揭秘「成大負碳工廠」【ESG永續台灣】EP04” YouTube , uploaded by 經貿!了解一下, 27 October 2022, https://youtu.be/HKiO0we__qs?si=m_DMxOBNqJ2sPnpX.
[65]	經貿!了解一下。“ESG 沒有想像中燒錢,執行 ESG 過程中可能帶來更多利益 【ESG永續台灣】EP05” YouTube , uploaded by 經貿!了解一下, 24 Novenber 2022, https://youtu.be/CHNGCb1RH4Q?si=_CcVcW3IpIGsb_cL.
[66]	經貿!了解一下。“ESG人才需求大爆發!斜槓的你! 會不會剛好就是ESG人才?【ESG永續台灣】EP06” YouTube , uploaded by 經貿!了解一下, 16 March 2023, https://youtu.be/pxQDvhucYOs?si=riYcQrNl6FUzhBDo.
[67]	經貿!了解一下。“還不知道臺灣碳權交易所就落伍了!必須知道的碳權知識總整理 【ESG永續台灣】EP07” YouTube , uploaded by 經貿!了解一下, 26 October 2023, https://youtu.be/MWNnAVvfXjI?si=zt7BWZgObi-PgEjb.
[68]	經貿!了解一下。“全民減碳可行嗎?從陽光伏特加看綠電商機的潛力和發展性【ESG永續台灣】EP08” YouTube , uploaded by 經貿!了解一下, 14 December 2023, https://youtu.be/H7eNXoIYdRQ?si=91546QDjFQcSUXlk.
[69]	經貿!了解一下。“一杯水剝離金、錫,躍升科技大廠愛用綠色製造鏈-ESG永續台灣 EP10” YouTube , uploaded by 經貿!了解一下, 13 June 2024, https://youtu.be/xj8iTRfP8BI?si=k__WQsahEG8r4Emw.
[70]	“Common Crawl - Open Repository of Web Crawl Data.” Common Crawl - Open Repository of Web Crawl Data, commoncrawl.org/.
[71]	Wei, Jason et al. “Chain of Thought Prompting Elicits Reasoning in Large Language Models.” ArXiv abs/2201.11903 (2022).
[72]	Kojima, Takeshi et al. “Large Language Models are Zero-Shot Reasoners.” ArXiv abs/2205.11916 (2022).
[73]	Yao, Shunyu et al. “ReAct: Synergizing Reasoning and Acting in Language Models.” ArXiv abs/2210.03629 (2022).
[74]	“Loha.” LoHa, huggingface.co/docs/peft/v0.7.1/en/package_reference/loha.
[75]	Hyeon-Woo, Nam et al. “FedPara: Low-rank Hadamard Product for Communication-Efficient Federated Learning.” International Conference on Learning Representations (2021).
[76]	Zhang, Lvmin et al. “Adding Conditional Control to Text-to-Image Diffusion Models.” 2023 IEEE/CVF International Conference on Computer Vision (ICCV) (2023): 3813-3824.
[77]	RVC-Boss. “RVC-Boss/GPT-Sovits: 1 Min Voice Data Can Also Be Used to Train a Good TTS Model! (Few Shot Voice Cloning).” GitHub, github.com/RVC-Boss/GPT-SoVITS. Accessed 22 Aug. 2024.
[78]	Suno-Ai. “Suno-Ai/Bark: Text-Prompted Generative Audio Model.” GitHub, github.com/suno-ai/bark. Accessed 22 Aug. 2024.
[79]	Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. 2021. Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval. In ICCV.
[80]	Unterthiner, Thomas et al. “Towards Accurate Generative Models of Video: A New Metric & Challenges.” ArXiv abs/1812.01717 (2018).
[81]	Radford, Alec et al. “Learning Transferable Visual Models From Natural Language Supervision.” International Conference on Machine Learning (2021).
[82]	He, Yin-Yin et al. “Latent Video Diffusion Models for High-Fidelity Long Video Generation.” (2022).
[83]	Luo, Zhengxiong et al. “VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation.” 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023): 10209-10218.
[84]	Zhu, Junchen et al. “MovieFactory: Automatic Movie Creation from Text using Large Generative Models for Language and Images.” Proceedings of the 31st ACM International Conference on Multimedia (2023).
[85]	Wu, Chenfei et al. “GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions.” ArXiv abs/2104.14806 (2021).
[86]	Singer, Uriel et al. “Make-A-Video: Text-to-Video Generation without Text-Video Data.” ArXiv abs/2209.14792 (2022).
論文全文使用權限
國家圖書館
不同意無償授權國家圖書館
校內
校內紙本論文立即公開
電子論文全文不同意授權
校內書目立即公開
校外
不同意授權予資料庫廠商
校外書目立即公開

如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信