§ 瀏覽學位論文書目資料
  
系統識別號 U0002-2408202008465900
DOI 10.6846/TKU.2020.00708
論文名稱(中文) 雙支架構應用區塊圖片於零次學習
論文名稱(英文) Two-Branch Net for Zero Shot Learning Using Patch Features
第三語言論文名稱
校院名稱 淡江大學
系所名稱(中文) 資訊工程學系碩士班
系所名稱(英文) Department of Computer Science and Information Engineering
外國學位學校名稱
外國學位學院名稱
外國學位研究所名稱
學年度 108
學期 2
出版年 109
研究生(中文) 許家豪
研究生(英文) Chia-Hao Hsu
學號 607410023
學位類別 碩士
語言別 繁體中文
第二語言別 英文
口試日期 2020-07-07
論文頁數 42頁
口試委員 指導教授 - 顏淑惠
委員 - 廖弘源
委員 - 林慧珍
關鍵字(中) 零次學習
區塊圖片特徵
關鍵字(英) Zero shot learning
Patch features
第三語言關鍵字
學科別分類
中文摘要
本篇論文以VGG19為Backbone基礎模型下,加入Squeeze-and-Excitation net以及加入PatchNet成為雙支網路。其中Squeeze-and-Excitation主要加強channel-wise 的特徵,而PatchNet主要用意為,藉由Patch可以在圖片上各個不同地方都提取到不一樣的特徵,像是取出物件的區塊特徵,可能有耳朵、鼻子、嘴巴、身體...,利用這些特徵在與VGG19取出特徵進行雙線性(Bilinear)的運算,且經過此運算能夠讓這兩個網路取出的特徵相互比較他們的關係性,進而可以提取出更加完整的物件本體,而非只有頭或是身體的特徵。最後本文在AwA2、CUB、SUN資料庫去進行測試,整體上雖然只有AwA2效果能夠與其他論文比較,且稍微的超過它文的結果,但我們主要的想法是提升物件特徵的完整性,也於本文最後使用熱圖(heat map)視覺化本文取出的特徵證實本文提出的方法有效。
英文摘要
We proposed a new patch net structure for zero shot learning (ZSL). In addition to the global features extracted from VGG19, patch net features are intended to catch the overall region-of-interested. These two features are fused via a bilinear operation. The fused image feature is mapped to the semantic space by a fully connected layer. The structure only adopts one simple cross entropy loss function so it is easy to train. According to the experiments, this method can extract more completeness features than those well-known backbones do in some images. In specific dataset, our method is competitive to other state of the art methods.
第三語言摘要
論文目次
目錄
第一章	緒論	1
1.1	簡介Zero-shot learning	1
1.2	方法摘要	4
1.3	論文架構	4
第二章	文獻回顧	6
2.1 相關文獻	6
第三章	研究方法	10
3.1 定義問題	10
3.2 TBPN架構	10
3.2.1 Feature Extraction	11
3.2.2 PatchNet	11
3.2.3 Bilinear Operation	11
3.3 最佳化TBPN網路	12
第四章	實驗	15
4.1 資料庫	15
4.2 衡量方法	15
4.3 訓練細節	16
4.4 實驗結果	16
第五章	結論與未來展望	28
5.1 結論	28
5.2 未來展望	28
參考文獻		29
附錄:英文論文	33

 
圖目錄
圖1. Bias示意圖,藍色圈代表seen class紅色圈代表unseen class,bias表示法參考於[1]。	4
圖2. TBPN的架構。上面分支Extract Feature使用pre-trained的VGG19,接著使用SE-Net(squeeze and excitation net) 進行特徵的挑選,下面分支將圖片切成4×4個patch後放入由多個卷積層所組成的網路提取特徵後將特徵拼接一起,接著與上支所提取的特徵進行bilinear的運算後放入全連接層算出語意屬性。	5
圖3. 馬的soul sample示意圖,圖片裡可能有多種方向的馬,所以soul sample為多個只要與其中之一類似即可,圖片取至[4]。	6
圖4. 此為Episode Training演算法,主要概念為每個batch都會取固定數量的類別以及圖片個數,並且將label重新標上固定數量的類別編號。	7
圖5. [10]的架構圖,主要分為ARE和ACSE,ARE主要是做Attention Masks的學習,而ACSE主要是將masks挑選出的特徵與backbone壓縮特徵去計算Second-order Pool去找尋特徵。	8
圖6. 此圖為RES101取出的特徵。	9
圖7. Squeeze-and-Excitation Network主要架構圖,將input特徵做global average pool後再經過fully connected接著使用sigmoid激活函數後乘回input特徵,藉此達到channel-wise的attention。	9
圖8. Patch Block如圖所示,由2個的kernel size= 3的卷積層、SE-Net、一個Maxpool層組成。	11
圖9. 架構圖加上與公式對應符號的圖。	12
圖10. VGG19(上)、PatchNet(下)baselines架構圖。VGG19為SE-Layer之後拉平,PatchNet為dilated convolution之前拉平嵌入至語意空間分類。上圖中25,088是由其前面的512×7×7 特徵拉平(flatten)而得,下圖則是由32×28×28 所拉平得到的。	18
圖11. 此圖為一些seen class圖片的heat maps,藍色框線為原圖,紅色框線為本文方法取出的,黑色框線為VGG19取出的。	22

圖12. 此圖為unseen class的heat map,藍色框線為原圖,紅色框線為本文方法取出的,黑色框線為VGG19取出的。	23
圖13. 此圖為seen class分類失敗的圖片,可以看見反應較高的都在非要辨識的物件上,都聚焦在背景或是其他物件上了,藍色框線為原圖,紅色框線為本文方法取出。	24
圖14. 此圖為unseen class分類失敗的圖片,可以看見反應較高的都在非要辨識的物件上,都聚焦在背景或是其他物件上了,藍色框線為原圖,紅色框線為本文方法取出。	25
圖15. 此圖為PatchNet取出之unseen class的heat map,紅色框線為本文方法,藍色框線為原圖,可以看見heat map對於物件本體皆有較多的反應,但背景雜訊也偏多。	26
圖16. 此圖可以看見紅色框住的部分,於VGG19取出特徵並沒有提取到,但本文PatchNet取出特徵則是有找出,達到兩者合一更具有鑑別性。	27

 
表目錄
表1. 零次學習在訓練時不同的設定	2
表2. 零次學習在測試時的不同的設定	3
表3. 各個資料庫,seen class以及unseen class的數量以及attribute的維度。	15
表4. 各個資料庫對於不同的backbone的影響。粗體字為最佳準確率,Ms為MCAs,Mus為MCAus,Har為Harmonic。	17
表5. 分離測試結果。TPBN為本文方法(雙支),Baseline的做法則是依照圖10將網路的上下分開的單支架構。	18
表6. 與其他論文比較,粗體字為最佳準確率,實驗方法依據[17]來做計算其中	19
表7. 此文本文架構訓練方法差異性表格。	20
參考文獻
[1] J. Song, C. Shen, Y. Yang, Y. Liu and M. Song, "Transductive Unbiased Embedding for Zero-Shot Learning," 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, 2018, pp. 1024-1033, doi: 10.1109/CVPR.2018.00113.
[2] A. Paul, N. C. Krishnan and P. Munjal, "Semantically Aligned Bias Reducing Zero Shot Learning," 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 7049-7058, doi: 10.1109/CVPR.2019.00722.
[3] H. Huang, C. Wang, P. S. Yu and C. Wang, "Generative Dual Adversarial Network for Generalized Zero-Shot Learning," 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 801-810, doi: 10.1109/CVPR.2019.00089. 
[4] J. Li, M. Jing, K. Lu, Z. Ding, L. Zhu and Z. Huang, "Leveraging the Invariant Side of Generative Zero-Shot Learning," 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 7394-7403, doi: 10.1109/CVPR.2019.00758. 
[5] M. B. Sariyildiz and R. G. Cinbis, "Gradient Matching Generative Networks for Zero-Shot Learning," 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 2163-2173, doi: 10.1109/CVPR.2019.00227. 
[6] E. Kodirov, T. Xiang and S. Gong, "Semantic Autoencoder for Zero-Shot Learning," 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, 2017, pp. 4447-4456, doi: 10.1109/CVPR.2017.473. 
[7] A. Frome et al., "DeViSE: A Deep Visual-Semantic Embedding Model," in NIPS, 2013, pp. 2121–2129. 
[8] C. H. Lampert, H. Nickisch and S. Harmeling, "Learning to detect unseen object classes by between-class attribute transfer," 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, 2009, pp. 951-958, doi: 10.1109/CVPR.2009.5206594. 
[9] C. H. Lampert, H. Nickisch and S. Harmeling, "Attribute-Based Classification for Zero-Shot Visual Object Categorization," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 3, pp. 453-465, March 2014, doi: 10.1109/TPAMI.2013.140. 
[10] G. Xie et al., "Attentive Region Embedding Network for Zero-Shot Learning," 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 9376-9385, doi: 10.1109/CVPR.2019.00961. 
[11] D. C. Mocanu and E. Mocanu, One-Shot Learning using Mixture of Variational Autoencoders: a Generalization Learning approach, In arXiv:1804.07645, 2018. 
[12] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. S. Torr and T. M. Hospedales, "Learning to Compare: Relation Network for Few-Shot Learning," 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, 2018, pp. 1199-1208, doi: 10.1109/CVPR.2018.00131. 
[13] J. Hu, L. Shen and G. Sun, "Squeeze-and-Excitation Networks," 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, 2018, pp. 7132-7141, doi: 10.1109/CVPR.2018.00745. 
[14] Y. Liu, J. Guo, D. Cai and X. He, "Attribute Attention for Semantic Disambiguation in Zero-Shot Learning," 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South), 2019, pp. 6697-6706, doi: 10.1109/ICCV.2019.00680. 
[15] K. Li, M. R. Min and Y. Fu, "Rethinking Zero-Shot Learning: A Conditional Visual Classification Perspective," 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South), 2019, pp. 3582-3591, doi: 10.1109/ICCV.2019.00368.
[16] T. Lin, A. RoyChowdhury and S. Maji, "Bilinear CNN Models for Fine-Grained Visual Recognition," 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, 2015, pp. 1449-1457, doi: 10.1109/ICCV.2015.170.
[17] Y. Xian, C. H. Lampert, B. Schiele and Z. Akata, "Zero-Shot Learning—A Comprehensive Evaluation of the Good, the Bad and the Ugly," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 9, pp. 2251-2265, 1 Sept. 2019, doi: 10.1109/TPAMI.2018.2857768.
[18] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-UCSD birds 200. 2010.
[19] G. Patterson and J. Hays, "SUN attribute database: Discovering, annotating, and recognizing scene attributes," 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, 2012, pp. 2751-2758, doi: 10.1109/CVPR.2012.6247998.
[20] S. Biswas and Y. Annadani, "Preserving Semantic Relations for Zero-Shot Learning," 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, 2018, pp. 7603-7612, doi: 10.1109/CVPR.2018.00793.
[21] Y. L. Cacheux, H. L. Borgne and M. Crucianu, "Modeling Inter and Intra-Class Relations in the Triplet Loss for Zero-Shot Learning," 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South), 2019, pp. 10332-10341, doi: 10.1109/ICCV.2019.01043.
[22] C. Szegedy et al., "Going deeper with convolutions," 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, 2015, pp. 1-9, doi: 10.1109/CVPR.2015.7298594.
[23] K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition, " 2015 International Conference on Learning Representations (ICLR), San Diego, CA, 2015, pp. 1–14.
[24] K. He, X. Zhang, S. Ren and J. Sun, "Deep Residual Learning for Image Recognition," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp. 770-778, doi: 10.1109/CVPR.2016.90.
[25] S. Ruder, An overview of gradient descent optimization algorithms. In arXiv:1609.04747, 2017.
[26] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh and D. Batra, "Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization," 2017 IEEE International Conference on Computer Vision (ICCV), Venice, 2017, pp. 618-626, doi: 10.1109/ICCV.2017.74.
論文全文使用權限
校內
校內紙本論文立即公開
同意電子論文全文授權校園內公開
校內電子論文立即公開
校外
同意授權
校外電子論文立即公開

如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信