系統識別號 | U0002-3007202009491100 |
---|---|
DOI | 10.6846/TKU.2020.00911 |
論文名稱(中文) | 基於注意力生成對抗網路的中文轉高解析度影像系統 |
論文名稱(英文) | A Chinese Text to High-Resolution Image Synthesis System Based on Attentional Generative Adversarial Networks |
第三語言論文名稱 | |
校院名稱 | 淡江大學 |
系所名稱(中文) | 電機工程學系碩士班 |
系所名稱(英文) | Department of Electrical and Computer Engineering |
外國學位學校名稱 | |
外國學位學院名稱 | |
外國學位研究所名稱 | |
學年度 | 108 |
學期 | 2 |
出版年 | 109 |
研究生(中文) | 張博涵 |
研究生(英文) | Bo-Han Zhang |
學號 | 607450128 |
學位類別 | 碩士 |
語言別 | 繁體中文 |
第二語言別 | |
口試日期 | 2020-07-11 |
論文頁數 | 47頁 |
口試委員 |
指導教授
-
李維聰(wtlee@mail.tku.edu.tw)
委員 - 朱國志(kcchu@mail.lhu.edu.tw) 委員 - 衛信文(hwwei@mail.tku.edu.tw) 委員 - 李維聰(wtlee@mail.tku.edu.tw) |
關鍵字(中) |
人工智慧 生成對抗網路 注意力機制 高解析度 |
關鍵字(英) |
artificial intelligence generative adversarial networks attention mechanism high resolution |
第三語言關鍵字 | |
學科別分類 | |
中文摘要 |
近年來人工智慧的發展下,在影像生成方面有許多應用,例如影像合成、影像風格轉換、生成高畫質影像……等等,其中利用中文語句生成影像的應用相當稀少。此外,傳統的生成對抗網路在生成影像時往往會在細節上生成比較模糊的特徵,將注意力機制加入到生成對抗網路裡可以讓句子中的每個詞配對影像中每個細節,例如:顏色、身體結構,這樣的效果可以讓影像不是只有生成大概符合的影像而是生成真正清楚又符合語句的影像。 因此本論文利用注意力機制結合生成對抗網路將中文描述的句子轉為高解析度的影像,本論文注意力生成對抗網路使用Google的InceptionResNetV2當作Image編碼器的架構與使用雙向GRU當作Text編碼器進行訓練,此模型總共有三個生成器與三個判斷器,三個生成器分別可以生成出64*64*3、128*128*3、256*256*3的影像,這樣的設計可以利用低畫質的影像生成高畫質的影像,三個判斷器則是分別對三個生成器進行評分,判斷器評分分為兩個部分,分別為判斷影像是否真實和影像是否符合句子的敘述,最後可以生成出具有高解析及高細部化的影像。 本論文利用Inception scores進行評分,Inception scores可以判斷出模型是否具有生成多樣性及真實性的影像,實驗結果顯示,本論文之模型GraGAN在英文語句的Inception scores為4.35分、其他模型:AttnGan為4.33分、StackGanV2為4.08分,本論文之模型GraGAN在中文語句的Inception scores為4.31分,其他模型:AttnGan為4.26分,本論文經過調整Text編碼器和Image編碼器的調整,Inception Scores都比AttnGAN的分數高。而在中文語句部分的分數略低的原因為中文語句不能像英文語句一樣每個單字都有獨立的意義,例如:螳螂要螳跟螂組在一起才有意思,如果將兩個字分開的話是沒有意義的,但英文沒有這個問題,因此分數略低於使用英文語句的模型是可以接受的,StackGanV2為生成高解析度的圖片但在細部的地方比本論文的影像表現還要遜色一點,因此分數有明顯的低於本論文。 本論文的貢獻在於將資料集轉換成為中文語句,透過中文語句前處理,並且加入雙向GRU的預訓練與InceptionResNetV2的編碼以降低訓練時間及運算量,實驗結果證明使用中文語句的資料集一樣可以在生成對抗網路這個系列的模型中取得良好的效果。 |
英文摘要 |
In the era of artificial intelligence, many applications of image generation have been proposed, e.g., image synthesis, image style conversion, generating high-quality images. However, the application of using Chinese sentences to generate images is quite rare. In addition, conventional generative adversarial networks often generate vague features on details when generating images. Adding attention mechanism to the generation of adversarial networks can match every detail in the image from every word in the sentence, such as color, body structure. Such an effect can make the system not only generate an image that roughly conforms but to generate an image that is really clear and conforms to the sentence. Therefore, this thesis uses the attention mechanism combined with the generative adversarial network to convert the Chinese sentences into high-resolution images. A model called GraGAN is proposed in this thesis. GraGAN utilizes Google's InceptionResNetV2 as an image encoder architecture and bidirectional GRU as a text encoder for training, respectively. This model has a total of three generators and three discriminators. The three generators can generate images of 64*64*3 pixels, 128*128*3 pixels, and 256*256*3 pixels. This design can use low-quality images to generate high-quality images. The three discriminators evaluate the three generators separately. The discriminator evaluation is divided into two parts, which are to discriminate whether the image is clear and whether the image meets the sentence narrative. Finally, images with high resolution and fine detail can be generated. This thesis uses Inception scores to evaluate the quality of image generation. Inception scores can determine the diversity and reality of the generated images. The experimental results show that the Inception score of GraGAN is 4.35 while training with English sentences. Other models: AttnGan is 4.33 and StackGanV2 is 4.08. While training with Chinese sentence, the Inception score of GraGAN is 4.31 and AttnGan is 4.26. Obviously, the Inception Scores of the proposed model GraGAN are higher than that of AttnGAN after adjusting the Text encoder and Image encoder. The score of the Chinese sentence training is slightly lower than that of English sentences training since a Chinese word is very different from an English word. Each English word has its own meaning and can be recognized as a token however, in Chinese, it may need to have two or three words to compose a meaningful term. Moreover, StackGanV2 is used to generate high-resolution images, but lacks of the details, so the score is significantly lower than this paper. The contribution of this thesis is to convert the dataset into Chinese sentences, and through the pre-processing of the Chinese sentence and add two-way GRU pre-training, InceptionResNetV2 coding to reduce training time and computational complexity of the proposed model GraGAN. The experimental results show that the Chinese dataset can also achieve good results as English dataset in generative adversarial network. |
第三語言摘要 | |
論文目次 |
目錄 致謝 I 中文摘要 II 英文摘要 IV 目錄 VII 圖目錄 IX 表目錄 Ⅺ 第一章 緒論 1 1.1 前言 1 1.2 動機與目的 2 1.3 論文章節架構 3 第二章 背景知識與相關文獻 5 2.1 生成對抗網路 5 2.2 注意力機制 7 2.3 Text to Image 9 2.3.1 Generative Adversarial Text to Image Synthesis 9 2.3.2 Stack GANv2 10 2.3.3 AttnGAN 12 2.4 InceptionV1~InceptionV3、ResNet、Inception-ResNetV2 14 第三章 基於注意力生成對抗網路的中文轉高解析度影像 22 3.1 流程圖 22 3.2 資料集 24 3.3 Jieba 24 3.4 LSTM轉GRU 25 3.5 InceptionV3轉InceptionResNetV2 29 第四章 實驗結果 32 4.1 實驗環境 32 4.2 Attention中文語句與生成影像 33 4.3 生成影像和文字的對應 34 4.4 各個模型英文語句生成影像比較 36 第五章 結論 43 5.1 主要貢獻 43 5.2 未來展望 43 參考文獻 45 圖目錄 圖2.1生成對抗網路示意圖 6 圖2.2變分自編碼器示意圖 7 圖2.3 (左)Scaled Dot-Product Attention、(右)Multi-Head Attention 8 圖2.4 Attention Encoder 模型結構圖[2] 9 圖2.5 文句轉圖片架構圖[8] 10 圖2.6 Stack GANv2架構圖 11 圖2.7 Stack GANv2判別網路架構 12 圖2.8 AttnGAN架構圖 13 圖2.9 Inception基本結構 14 圖2.10 改良後的Inception架構 15 圖2.11 InceptionV3 非對稱卷積層架構 16 圖2.12 InceptionV3架構(1) 16 圖2.13 InceptionV3架構(2) 16 圖2.14 殘差網路架構 18 圖2.15 InceptionResnetV2架構圖 19 圖2.16 Stem網路層 19 圖2.17 Inception-Resnet-A[7] 20 圖2.18 Inception-Resnet-B[7] 20 圖2.19 Inception-Resnet-C[7] 20 圖2.20 Reduction-A網路層[7] 21 圖2.21 Reduction-B網路層[7] 21 圖3.1中文語句分詞流程 23 圖3.2 論文整體架構 23 圖3.3 Caltech-UCSD Birds-200-2011[21] 24 圖3.4 Jieba分詞 25 圖3.5 RNN與LSTM架構差別 26 圖3.6 LSTM 整體架構 27 圖3.7 GRU 整體架構 28 圖3.8雙向GRU 29 圖3.9 Local Features 17*17*1088示意圖 31 圖4.1 Attention文字與影像 33 圖4.2 生成低畫質(64*64*3)到高畫質的影像(256*256*3) 34 圖4.3 文句與生成影像的對照 35 圖4.4 GraGAN依據中文語句生成的影像 42 表目錄 表2.1 InceptionV3整體架構 17 表3.1 各個模型數據比較[7] 30 表4.1 英文語句Inception Scores 36 表4.2 GraGAN、AttnGAN LSTM與GRU在Inception Scores與訓練時間的比較 37 表4.3 AttnGAN與本論文在中文語句的比較 38 |
參考文獻 |
參考文獻 [1] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio “Generative Adversarial Networks” , Neural Information Processing Systems,2014 [2] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin “Attention Is All You Need” , Neural Information Processing Systems,2017 [3] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, Xiaodong He, “AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks”, Computer Vision and Pattern Recognition,2017 [4] Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, Jürgen Schmidhuber, “Gradient Flow in Recurrent Nets: the Diffculty of Learning Long-Term Dependencies” , IEEE Press,2001 [5] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, Yoshua Bengio, “Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling” , Neural Information Processing Systems ,2014 [6] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, Zbigniew Wojna, “Rethinking the Inception Architecture for Computer Vision”, Computer Vision and Pattern Recognition,2015 [7] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Alex Alemi, “Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning”, Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence,2016 [8] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, Honglak Lee, “Generative Adversarial Text to Image Synthesis”, International Conference on Machine Learning,2016 [9] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, Dimitris Metaxas, “StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks”, IEEE Transactions on Pattern Analysis and Machine Intelligence,2018 [10] Diederik P Kingma, Max Welling, “Auto-Encoding Variational Bayes” , International Conference on Learning Representations,2014 [11] Ilya Sutskever, Oriol Vinyals, Quoc V. Le, “Sequence to Sequence Learning with Neural Networks” , Neural Information Processing Systems,2014 [12] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, Yann N. Dauphin, “Convolutional Sequence to Sequence Learning”, Computer Supported Collaborative Learning,2017 [13] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, Dimitris Metaxas, “StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks”, International Conference on Computer Vision,2017 [14] Mark Palatucci, Dean Pomerleau, Geoffrey E. Hinton, Tom M. Mitchell “Zero-shot Learning with Semantic Output Codes”, Neural Information Processing Systems ,2009 [15] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich, “Going Deeper with Convolutions”, Computer Vision and Pattern Recognition,2014 [16] Karen Simonyan, Andrew Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition”, Computer Vision and Pattern Recognition,2015 [17] Xavier Glorot,Antoine Bordes,Yoshua Bengio, “Deep Sparse Rectifier Neural Networks (ReLU)”, Artificial Intelligence and Statistics, PMLR ,2011 [18] Sergey Ioffe, Christian Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”, International Conference on Machine Learning,2015 [19] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”, Journal of Machine Learning Research,2014 [20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, “ Deep Residual Learning for Image Recognition”, Computer Vision and Pattern Recognition,2015 [21] http://www.vision.caltech.edu/visipedia/CUB-200-2011.html Shane Barratt,Rishi Sharma, “A Note on the Inception Scores”, International Conference on Machine Learning,2018 |
論文全文使用權限 |
如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信