| 系統識別號 | U0002-2907202511002000 |
|---|---|
| DOI | 10.6846/tku202500626 |
| 論文名稱(中文) | 基於 SAM 的語意切片分析:朝向優化 Vision Transformer 輸入表示之探討 |
| 論文名稱(英文) | Analyzing Semantic Patch Generation with SAM: Toward Improved ViT Input Representation |
| 第三語言論文名稱 | |
| 校院名稱 | 淡江大學 |
| 系所名稱(中文) | 資訊工程學系全英語碩士班 |
| 系所名稱(英文) | Master's Program, Department of Computer Science and Information Engineering (English-taught program) |
| 外國學位學校名稱 | |
| 外國學位學院名稱 | |
| 外國學位研究所名稱 | |
| 學年度 | 113 |
| 學期 | 2 |
| 出版年 | 114 |
| 研究生(中文) | 劉哲維 |
| 研究生(英文) | Zhe-Wei Liu |
| 學號 | 611780056 |
| 學位類別 | 碩士 |
| 語言別 | 英文 |
| 第二語言別 | |
| 口試日期 | 2025-07-01 |
| 論文頁數 | 43頁 |
| 口試委員 |
口試委員
-
武士戎(wushihjung@mail.tku.edu.tw)
指導教授 - 陳啓禎(cjchen@mail.tku.edu.tw) 口試委員 - 郭文嘉(wjkuo@saturn.yzu.edu.tw) |
| 關鍵字(中) |
語意切片 圖像分割 Vision Transformer Segment Anything Model 語意切割 |
| 關鍵字(英) |
Semantic Patch Image Segmentation Vision Transformer Segment Anything Model Semantic Segmentation |
| 第三語言關鍵字 | |
| 學科別分類 | |
| 中文摘要 |
近年來,Vision Transformer(ViT)在電腦視覺領域展現強大潛力,成為許多影像辨識任 務的主流架構。然而,在主流 ViT 相關論文中,資料預處理方法對模型效能的影響往往未受到 足夠關注。BEiT V1 首度將 BERT 於自然語言處理領域的遮蔽預訓練思想延伸至影像,提出遮 蔽影像建模自監督方法,顯著提升 ViT 模型的特徵表達能力。隨後,BEiT V2 進一步強化資料 語意標註於預處理流程中的應用,突顯語意資訊在輸入端的關鍵作用。2023 年,Meta 發表 Segment Anything Model(SAM),具備自動生成高品質語意遮罩的能力,大幅拓展語意分割與 切片於各式影像場景的應用潛能。 本研究受到上述成果啟發,提出以 SAM 產生之語意切片取代傳統固定切片,作為 ViT 輸 入資料的新型預處理方式,系統性探討此法對 ViT 模型效能的實際影響。研究採用海洋動物影 像資料集進行實驗,深入分析 SAM 分割結果在不同成像條件下的語意品質,並將其語意切片 應用於 ViT 訓練及推論流程。實驗結果顯示,結合 SAM 語意切片的預處理框架可提升 ViT 於 分類與分割任務上的表現,並展現資料預處理於 Transformer 架構下的重要性與創新價值。本 論文之貢獻在於系統性驗證資料語意預處理策略於 ViT 應用中的影響,並為後續相關研究奠定 基礎。 |
| 英文摘要 |
In recent years, Vision Transformers (ViTs) have demonstrated remarkable performance in computer vision, rapidly becoming the standard architecture for various image recognition tasks. However, the influence of data preprocessing on ViT model performance is frequently underexplored in existing literature. BEiT V1 introduced the BERT-style masked pretraining strategy from natural language processing into vision, establishing self-supervised masked image modeling that significantly improved ViT representations. BEiT V2 further underscored the value of semantic annotation in preprocessing, emphasizing the importance of semantic information at the input stage. In 2023, Meta's Segment Anything Model (SAM) enabled the automatic generation of high-quality semantic masks, broadening the potential for semantic segmentation and patch generation across diverse imaging contexts. Motivated by these advancements, this study presents a novel preprocessing approach that employs SAM-generated semantic patches instead of traditional fixed-size patches as ViT input, systematically assessing the effect on ViT model performance. Experiments on a marine animal image dataset evaluate the quality of SAM-generated masks under varied conditions and integrate these semantic patches into the ViT training and inference pipeline. Results show that SAM-based semantic patch preprocessing enhances ViT performance in classification and segmentation tasks, highlighting the critical role of data preprocessing in transformer-based vision models and providing a foundation for future work. |
| 第三語言摘要 | |
| 論文目次 |
Table of Contents Chapter 1 Introduction 1 1.1 Background and Motivation 1 1.2 Research Objectives 2 1.3 Research Scope 3 1.4 Structure of the Thesis 4 Chapter 2: Related Works 5 2.1 Vision Transformer (ViT) 5 2.2 BEiT: BERT Pre-Training of Image Transformers 7 2.3 Segment Anything Model (SAM) 8 2.4 sViT: Semantic Vision Transformer 10 Chapter 3: Research Methodology 12 3.1 Semantic Segmentation with SAM 13 3.2 Qualitative Results: Successful and Challenging Cases 13 Chapter 4: Experimental Results and Discussions 19 4.1 Dataset and Preprocessing 19 4.2 Classification Results 21 4.3 Segmentation Results 24 4.3.1 Overview of the Experimental Setup 24 4.3.2 Performance Metrics and Comparative Analysis 25 4.3.3 Confusion Matrix Analysis 26 4.4 Discussions 32 4.4.1 Classification Experiment Results 32 4.4.2 Segmentation Experiment Results 33 Chapter 5: Conclusion and Future Works 38 5.1 Conclusion 38 5.2 Future Works 39 References 41 List of Figures Figure 1. The structure of ViT [1]. 7 Figure 2. Segment Anything Model (SAM) overview.[9] 10 Figure 3. Proposed workflow integrating SAM-based semantic segmentation with Vision Transformer input pipeline. 13 Figure 4. Successful examples of SAM segmentation. 16 Figure 5. Failed examples of SAM segmentation. 17 Figure 6. SAM segmentation pipeline with preprocessing. 18 Figure 7. Training loss and accuracy curves for ViT-B on the marine animal dataset. 23 Figure 8. Training loss and accuracy curves for ViT-L on the marine animal dataset. 23 Figure 9. Confusion matrix of ViT+SAM on ImageNet-C. 31 List of Tables Table 1. Structure of the thesis 4 Table 2. Label_map: the numbers of images for each label. 21 Table 3. Dataset classification results (Dataset: ImageNet-C). 23 Table 4. ViT training parameters (Dataset: ImageNet-C). 24 Table 5. Segmentation performance comparison between ViT+SAM and selected SAM-Based methods (Dataset: ImageNet-C). 31 |
| 參考文獻 |
[1] Dosovitskiy, A., L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, ... and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, Oct. 2020. [2] Touvron, H., M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, “Training data-efficient image transformers & distillation through attention,” in Proceedings of the International Conference on Machine Learning (ICML), vol. 139, pp. 10347–10357, Jul. 2021. [3] Liu, Z., Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, ... and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10012–10022, Oct. 2021. [4] Chen, J., Y. Lu, Q. Yu, X. Luo, E. Adeli, Y. Wang, ... and Y. Zhou, “Transunet: Transformers make strong encoders for medical image segmentation,” arXiv preprint arXiv:2102.04306, Feb. 2021. [5] Bao, H., L. Dong, S. Piao, and F. Wei, “Beit: Bert pre-training of image transformers,” arXiv preprint arXiv:2106.08254, Jun. 2021. [6] He, K., X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16000–16009, Jun. 2022. [7] Peng, Z., L. Dong, H. Bao, Q. Ye, and F. Wei, “Beit v2: Masked image modeling with vector-quantized visual tokenizers,” arXiv preprint arXiv:2208.06366, Aug. 2022. [8] Kirillov, A., E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, ... and R. Girshick, “Segment anything,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4015–4026, Oct. 2023. [9] Han, K., Y. Wang, H. Chen, X. Chen, J. Guo, Z. Liu, ... and D. Tao, “A survey on vision transformer,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 1, pp. 87–110, Jan. 2023, doi: https://doi.org/10.1109/TPAMI.2022.3152247. [10] H. Thisanke, A. S. I. Munasinghe, and D. Herath, “Semantic segmentation using Vision Transformers: A survey,” Engineering Applications of Artificial Intelligence, vol. 123, pp. 107357, Mar. 2023. [11] X. Li, H. Zhou, and D. Yang, “Transformer-based visual segmentation: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 2, pp. 234-257, Feb. 2024. [12] C. Chen et al., “MA-SAM: Modality-agnostic SAM adaptation for 3D medical image segmentation,” arXiv preprint arXiv:2309.08842, Sep. 2023. [13] K. Zhang and D. Liu, “SAMed: Customized Segment Anything Model for medical image segmentation,” arXiv preprint arXiv:2304.13785, Apr. 2023. [14] S. Roy, A. Banik, and A. Chakraborty, “Research on medical image segmentation based on SAM and its applicability,” Biosensors and Bioelectronics, vol. 241, no. 2, pp. 608-621, Apr. 2024. [15] S. Roy, M. Rahman, and A. Chakraborty, “SAM.MD: Zero-shot medical image segmentation capabilities of the Segment Anything Model,” arXiv preprint arXiv:2304.05396, Apr. 2023. [16] Y. Li, H. Zhao, and W. Liu, “Polyp-SAM: Transfer SAM for polyp segmentation,” arXiv preprint arXiv:2305.00293, May 2023. [17] S. Gong et al., “3DSAM-adapter: Holistic adaptation of SAM from 2D to 3D for promptable tumor segmentation,” arXiv preprint arXiv:2306.13465, Jun. 2023. [18] Deng, J., W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248–255, Jun. 2009. [19] Zhu, J., X. Dong, Y. Dai, and L. Zhang, “Large-scale unsupervised semantic segmentation (ImageNet-S),” arXiv preprint arXiv:2106.03149, Jun. 2021. [20] Wang, L., J. Zhu, et al., “PartImageNet: A large, high-quality dataset of parts,” in Proceedings of the Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks, 2022. [21] Li, J., Y. Wang, and Z. Ji, “BigDatasetGAN: Synthesizing ImageNet with pixel-wise annotations,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14368–14377, Jun. 2022. [22] Iglovikov, V., and A. Shvets, “TernausNet: U-Net with VGG11 encoder pre-trained on ImageNet for image segmentation,” arXiv preprint arXiv:1801.05746, Jan. 2018. |
| 論文全文使用權限 |
如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信