| 系統識別號 | U0002-0809202311222000 |
|---|---|
| DOI | 10.6846/tku202300650 |
| 論文名稱(中文) | 基於音頻和視覺特徵的多模式相似影片檢測與定位 |
| 論文名稱(英文) | Multimodal Video Similarity Detection and Localization Using Audio-Visual Features |
| 第三語言論文名稱 | |
| 校院名稱 | 淡江大學 |
| 系所名稱(中文) | 資訊工程學系全英語碩士班 |
| 系所名稱(英文) | Master's Program, Department of Computer Science and Information Engineering (English-taught program) |
| 外國學位學校名稱 | |
| 外國學位學院名稱 | |
| 外國學位研究所名稱 | |
| 學年度 | 111 |
| 學期 | 2 |
| 出版年 | 112 |
| 研究生(中文) | 羅曼 |
| 研究生(英文) | Roman Akchurin |
| ORCID | 0009-0005-4830-1763 |
| 學號 | 610785015 |
| 學位類別 | 碩士 |
| 語言別 | 英文 |
| 第二語言別 | |
| 口試日期 | 2023-06-12 |
| 論文頁數 | 75頁 |
| 口試委員 |
指導教授
-
武士戎(wushihjung@mail.tku.edu.tw)
口試委員 - 蒯思齊(sckuai@ntub.edu.tw) 口試委員 - 張志勇(cychang@mail.tku.edu.tw) |
| 關鍵字(中) |
音視覺特徵 影片內容分析 相似影片定位 影片副本檢測 |
| 關鍵字(英) |
Audio-Visual Features Video Content Analysis Video Similarity Localization Video Copy Detection |
| 第三語言關鍵字 | |
| 學科別分類 | |
| 中文摘要 |
在數位時代,隨著YouTube等平台上的影片內容激增,高效的影片相似性定位(VSL)系統的需求變得至關重要。VSL以其能夠在時間上對齊相似的影片片段的能力,應對了龐大的影片數據量和不斷增加侵權情況所帶來的挑戰。當前的VSL方法通常專注於視覺或聽覺特徵之一,但很少充分發揮兩者之間的協同作用。本研究設計了一個多模式影片相似性定位(MVSL)管道,無縫集成了音頻和視覺特徵,以增強相似性定位任務的性能。本研究引入了一個新型的深度學習管道,並使用一個特別精選的影片相似性定位數據集(VSLD),透過VSLD強調了該方法的效能。MVSL系統從影片預處理、音視頻 特徵提取開始,然後進行相似性映射和時間對齊。其中音頻特徵在VSL中發揮了決定性作用。這項工作為下一代影片內容分析工具奠定了基礎,實現了自動化、可擴展和精確的影片相似性檢測,對於內容創作者和數位媒體平台具有巨大價值。實驗結果與其他研究相比也展示了出色的性能。 |
| 英文摘要 |
In the digital age, with the upsurge in video content on platforms like YouTube, the need for efficient video similarity localization (VSL) systems has become paramount. VSL, with its ability to temporally align similar video segments, addresses challenges posed by the sheer volume of video data and the increasing instances of copyright infringement. Current VSL methods often focus on either visual or auditory features but rarely capitalize on the synergy between the two. This research introduces a Multimodal Video Similarity Localization (MVSL) pipeline that seamlessly integrates audio and visual features to enhance the performance of similarity localization tasks. The study introduces a novel deep learning pipeline, a specially curated Video Similarity Localization Dataset (VSLD), and underscores the efficiency of the approach using the VSLD. The MVSL system begins with video preprocessing, audio-visual feature extraction, and then progresses to similarity mapping and temporal alignment. The results showcase outstanding performance, with auditory features playing a decisive role in VSL. This work lays a foundation for the next generation of video content analysis tools, enabling automated, scalable, and precise video similarity detection that holds immense value for content creators and digital media platforms. |
| 第三語言摘要 | |
| 論文目次 |
Contents 1 Introduction 1 2 Related Work 4 2.1 Video Feature Representation 5 2.2 Video Temporal Alignment 12 3 Background 19 3.1 Maximum Activation of Convolutions 19 3.2 MobileNet 20 3.3 EfficientNet 22 3.4 ConvNeXt 25 3.5 High-Resolution Network 27 3.6 Connected Components Algorithm 28 3.7 RANSAC Regression 31 4 Proposed Method 33 4.1 Video Frame Extraction 35 4.2 Frame Stack Detection 36 4.3 Visual Feature Extraction 37 4.4 Visual Similarity Map Generation 39 4.5 Audio Feature Extraction 41 4.6 Audio Similarity Map Generation 43 4.7 Audio-Visual Similarity Map Generation 44 4.8 Similarity Filtering 44 4.9 Similar Segment Localization 46 5 Experiments 48 5.1 Dataset 48 5.2 Metrics 53 5.3 Evaluation Results 56 6 Conclusion 65 Bibliography 67 List of Figures 3.1 Maximum Activation of Convolutions feature extraction method 19 3.2 Depthwise Separable Convolutions 21 3.3 Inverted Residual with Linear Bottleneck block 22 3.4 Inverted Residual with Linear Bottleneck and Squeeze-and-Excitation block 22 3.5 EfficientNets scale the CNN with constant ratio 23 3.6 EfficientNetV2 convolutional blocks 24 3.7 Progressive learning process 24 3.8 Block architectures of Swin Transformer, ResNet, and ConvNeXt 26 3.9 Fully Convolutional Masked Autoencoder 26 3.10 ConvNeXtV2 block diagram with GRN layer 27 3.11 High-Resolution Network architecture 28 3.12 4-neighbors and 8-neighbors of a pixel P 29 3.13 4-connected and 8-connected components 29 3.14 The outcome of the Connected Components algorithm 30 3.15 A fitted line with RANSAC regression algorithm 32 4.1 Video Similarity Detection and Localization objective 33 4.2 Multimodal Video Similarity Localization pipeline 34 4.3 Illustrations of vertical, horizontal, and grid-of-four stack arrangements in query videos 36 4.4 ConvNeXtV2 model for frame stack classification 36 4.5 Query videos are split into corresponding scenes based on the results from stack classification model 37 4.6 ISCNet feature extraction process 38 4.7 Computing the similarity score between a pair of frame descriptors 38 4.8 Temporal concatenation with Gaussian weighting 39 4.9 The formation of batches of frame descriptors 40 4.10 Batch similarity is calculated as a matrix product between two batches of frame descriptors 40 4.11 Audio Fast Fourier Transform 41 4.12 Audio CNN feature extraction model 42 4.13 Computation of the audio similarity score 42 4.14 The formation of batches of audio descriptors 43 4.15 Audio batch similarity generation 43 4.16 Audio-visual similarity map generation 44 4.17 The similarity filter effectively removes the majority of dissimilar similarity maps 45 4.18 Similarity evidence is influenced by both auditory and visual content 45 4.19 Similarity localization with HRNet 46 5.1 Examples of query (left) and corresponding reference (right) videos from VSLD training set 49 5.2 VSLD directory structure 52 5.3 Calculation of the precision metric 54 5.4 Calculation of the recall metric 55 5.5 Frame stack detection model accuracy 57 5.6 Frame stack detection model loss 57 5.7 Similarity filtering model accuracy 58 5.8 Similarity filtering model loss 59 5.9 Precision metrics evaluated on VSLD test set 60 5.10 Recall metrics evaluated on VSLD test set 61 5.11 F1 score evaluation results 62 5.12 Precision-recall curve of MVSL evaluated on VSLD test set 63 List of Tables 2.1 Deep learning-based methods for extracting video features 11 2.2 Video temporal alignment methods 18 5.1 VSLD statistics 51 5.2 Micro AP evaluation results on VSLD test data 64 5.3 Ablation study of MVSL localization performance 64 |
| 參考文獻 |
[1] Evlampios Apostolidis, Eleni Adamantidou, Alexandros I Metsai, Vasileios Mezaris, and Ioannis Patras. Video summarization using deep neural networks: A survey. Proceedings of the IEEE, 109(11):1838–1863, 2021. [2] Pavlos Avgoustinakis, Giorgos Kordopatis-Zilos, Symeon Papadopoulos, Andreas L Symeonidis, and Ioannis Kompatsiaris. Audio-based near-duplicate video retrieval with audio similarity learning. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 5828–5835. IEEE, 2021. [3] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016. [4] Lorenzo Baraldi, Matthijs Douze, Rita Cucchiara, and Herv´e J´egou. Lamv: Learning to align and match videos with kernelized temporal layers. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7804–7813, 2018. [5] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features. In Computer Vision–ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, May 7-13, 2006. Proceedings, Part I 9, pages 404–417. Springer, 2006. [6] Donald J Berndt and James Clifford. Using dynamic time warping to find patterns in time series. In KDD workshop, volume 10, pages 359–370. Seattle, WA, USA:, 1994. [7] Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard S¨ackinger, and Roopak Shah. Signature verification using a” siamese” time delay neural network. Advances in neural information processing systems, 6, 1993. [8] John Canny. A computational approach to edge detection. IEEE Transactions on pattern analysis and machine intelligence, (6):679–698, 1986. [9] Mathilde Caron, Hugo Touvron, Ishan Misra, Herv´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021. [10] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017. [11] Rizwan Chaudhry, Avinash Ravichandran, Gregory Hager, and Ren´e Vidal. Histograms of oriented optical flow and binet-cauchy kernels on nonlinear dynamical systems for the recognition of human actions. In 2009 IEEE conference on computer vision and pattern recognition, pages 1932–1939. IEEE, 2009. [12] Gal Chechik, Varun Sharma, Uri Shalit, and Samy Bengio. Large scale online learning of image similarity through ranking. Journal of Machine Learning Research, 11(3), 2010. [13] Jianping Chen and Tiejun Huang. A robust feature extraction algorithm for audio fingerprinting. In Advances in Multimedia Information Processing-PCM 2008: 9th Pacific Rim Conference on Multimedia, Tainan, Taiwan, December 9-13, 2008. Proceedings 9, pages 887–890. Springer, 2008. [14] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020. [15] Sumit Chopra, Raia Hadsell, and Yann LeCun. Learning a similarity metric discriminatively, with application to face verification. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 1, pages 539–546. IEEE, 2005. [16] Chien-Li Chou, Hua-Tsung Chen, and Suh-Yin Lee. Pattern-based near-duplicate video retrieval and localization on web-scale videos. IEEE Transactions on Multimedia, 17(3):382–395, 2015. [17] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. [18] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In International conference on machine learning, pages 647–655. PMLR, 2014. [19] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. [20] Matthijs Douze, Herv´e J´egou, and Cordelia Schmid. An image-based approach to video copy detection with spatio-temporal post-filtering. IEEE Transactions on Multimedia, 12(4):257–266, 2010. [21] Matthijs Douze, Giorgos Tolias, Ed Pizzi, Zo¨e Papakipos, Lowik Chanussot, Filip Radenovic, Tomas Jenicek, Maxim Maximov, Laura Leal-Taix´e, Ismail Elezi, et al. The 2021 image similarity dataset and challenge. arXiv preprint arXiv:2106.09672, 2021. [22] Alaaeldin El-Nouby, Natalia Neverova, Ivan Laptev, and Herv´e J´egou. Training vision transformers for image retrieval. arXiv preprint arXiv:2102.05644, 2021. [23] Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981. [24] Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 776–780. IEEE, 2017. [25] Nicolas Gengembre, Sid-Ahmed Berrani, and Patrick Lechat. Adaptive similarity search in large databases-application to image/video copy detection. In 2008 International Workshop on Content-Based Multimedia Indexing, pages 496–503. IEEE, 2008. [26] Albert Gordo, Jon Almazan, Jerome Revaud, and Diane Larlus. End-to-end learning of deep visual representations for image retrieval. International Journal of Computer Vision, 124(2):237–254, 2017. [27] Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, David Simcha, Felix Chern, and Sanjiv Kumar. Accelerating large-scale inference with anisotropic vector quantization. In International Conference on Machine Learning, pages 3887–3896. PMLR, 2020. [28] Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06), volume 2, pages 1735–1742. IEEE, 2006. [29] Jaap Haitsma and Ton Kalker. A highly robust audio fingerprinting system. In Ismir, volume 2002, pages 107–115, 2002. [30] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022. [31] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. [32] Sifeng He, Xudong Yang, Chen Jiang, Gang Liang, Wei Zhang, Tan Pan, Qing Wang, Furong Xu, Chunguang Li, JinXiong Liu, et al. A large-scale comprehensive dataset and copy-overlap aware evaluation protocol for segment-level video copy detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21086–21095, 2022. [33] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016. [34] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. [35] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1314–1324, 2019. [36] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017. [37] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018. [38] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456. pmlr, 2015. [39] Chen Jiang, Kaiming Huang, Sifeng He, Xudong Yang, Wei Zhang, Xiaobo Zhang, Yuan Cheng, Lei Yang, Qing Wang, Furong Xu, et al. Learning segment similarity and alignment in large-scale content based video retrieval. In Proceedings of the 29th ACM International Conference on Multimedia, pages 1618–1626, 2021. [40] Qing-Yuan Jiang, Yi He, Gen Li, Jian Lin, Lei Li, and Wu-Jun Li. Svd: A large-scale short video dataset for near-duplicate video retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5281–5289, 2019. [41] Yu-Gang Jiang, Yudong Jiang, and Jiajun Wang. Vcdb: a large-scale database for partial copy detection in videos. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part IV 13, pages 357–371. Springer, 2014. [42] Jeff Johnson, Matthijs Douze, and Herv´e J´egou. Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 7(3):535–547, 2019. [43] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. [44] Giorgos Kordopatis-Zilos, Symeon Papadopoulos, Ioannis Patras, and Ioannis Kompatsiaris. Fivr: Fine-grained incident video retrieval. IEEE Transactions on Multimedia, 21(10):2638–2652, 2019. [45] Giorgos Kordopatis-Zilos, Symeon Papadopoulos, Ioannis Patras, and Ioannis Kompatsiaris. Visil: Fine-grained spatio-temporal video similarity learning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6351–6360, 2019. [46] Giorgos Kordopatis-Zilos, Symeon Papadopoulos, Ioannis Patras, and Yiannis Kompatsiaris. Near-duplicate video retrieval by aggregating intermediate cnn layers. In MultiMedia Modeling: 23rd International Conference, MMM 2017, Reykjavik, Iceland, January 4-6, 2017, Proceedings, Part I 23, pages 251–263. Springer, 2017. [47] Giorgos Kordopatis-Zilos, Christos Tzelepis, Symeon Papadopoulos, Ioannis Kompatsiaris, and Ioannis Patras. Dns: Distill-and-select for efficient and accurate video indexing and retrieval. International Journal of Computer Vision, 130(10):2385–2407, 2022. [48] Lyudmyla F Kozachenko and Nikolai N Leonenko. Sample estimate of the entropy of a random vector. Problemy Peredachi Informatsii, 23(2):9–16, 1987. [49] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012. [50] Anurag Kumar, Maksim Khadkevich, and Christian F¨ugen. Knowledge transfer from weakly labeled audio using convolutional neural network for sound events and scenes. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 326–330. IEEE, 2018. [51] Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12009–12019, 2022. [52] Zhenhua Liu, Feipeng Ma, Tianyi Wang, and Fengyun Rao. A similarity alignment model for video copy segment matching. arXiv preprint arXiv:2305.15679, 2023. [53] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11976–11986, 2022. [54] David G Lowe. Object recognition from local scale-invariant features. In Proceedings of the seventh IEEE international conference on computer vision, volume 2, pages 1150–1157. Ieee, 1999. [55] David G Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60:91–110, 2004. [56] Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814, 2010. [57] Paul Over. Trecvid 2013–an overview of the goals, tasks, data, evaluation mechanisms and metrics. 2013. [58] Zoe Papakipos and Joanna Bitton. Augly: Data augmentations for robustness. arXiv preprint arXiv:2201.06494, 2022. [59] James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. Object retrieval with large vocabularies and fast spatial matching. In 2007 IEEE conference on computer vision and pattern recognition, pages 1–8. IEEE, 2007. [60] Aggelos Pikrakis, Theodoros Giannakopoulos, and Sergios Theodoridis. An overview of speech/music discrimination techniques in the context of audio recordings. Multimedia Services in Intelligent Environments: Advanced Tools and Methodologies, pages 81–102, 2008. [61] Ed Pizzi, Giorgos Kordopatis-Zilos, Hiral Patel, Gheorghe Postelnicu, Sugosh Nagavara Ravindra, Akshay Gupta, Symeon Papadopoulos, Giorgos Tolias, and Matthijs Douze. The 2023 video similarity dataset and challenge. arXiv preprint arXiv:2306.09489, 2023. [62] Ed Pizzi, Sreya Dutta Roy, Sugosh Nagavara Ravindra, Priya Goyal, and Matthijs Douze. A self-supervised descriptor for image copy detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14532–14542, 2022. [63] S´ebastien Poullot, Shunsuke Tsukatani, Anh Phuong Nguyen, Herv´e J´egou, and Shin’Ichi Satoh. Temporal matching kernel with explicit feature maps. In Proceedings of the 23rd ACM international conference on Multimedia, pages 381–390, 2015. [64] Ali S Razavian, Josephine Sullivan, Stefan Carlsson, and Atsuto Maki. Visual instance retrieval with deep convolutional networks. ITE Transactions on Media Technology and Applications, 4(3):251–258, 2016. [65] R Roopalakshmi and G Ram Mohana Reddy. A novel approach to video copy detection using audio fingerprints and pca. Procedia Computer Science, 5:149–156, 2011. [66] R Roopalakshmi and G Ram Mohana Reddy. A novel spatio-temporal registration framework for video copy localization based on multimodal features. Signal processing, 93(8):2339–2351, 2013. [67] Edward Rosten, Reid Porter, and Tom Drummond. Faster and better: A machine learning approach to corner detection. IEEE transactions on pattern analysis and machine intelligence, 32(1):105–119, 2008. [68] Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, and Herv´e J´egou. Spreading vectors for similarity search. arXiv preprint arXiv:1806.03198, 2018. [69] Hiroaki Sakoe and Seibi Chiba. Dynamic programming algorithm optimization for spoken word recognition. IEEE transactions on acoustics, speech, and signal processing, 26(1):43–49, 1978. [70] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018. [71] Jie Shao, Xin Wen, Bingchen Zhao, and Xiangyang Xue. Temporal context aggregation for video retrieval with contrastive learning. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 3268–3278, 2021. [72] Laurent Sifre and St´ephane Mallat. Rigid-motion scattering for texture classification. arXiv preprint arXiv:1403.1687, 2014. [73] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. [74] Sivic and Zisserman. Video google: A text retrieval approach to object matching in videos. In Proceedings ninth IEEE international conference on computer vision, pages 1470–1477. IEEE, 2003. [75] Hung-Khoon Tan, Chong-Wah Ngo, Richard Hong, and Tat-Seng Chua. Scalable detection of partial near-duplicate videos by visual-temporal consistency. In Proceedings of the 17th ACM international conference on Multimedia, pages 145–154, 2009. [76] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V Le. Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2820–2828, 2019. [77] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019. [78] Mingxing Tan and Quoc Le. Efficientnetv2: Smaller models and faster training. In International conference on machine learning, pages 10096–10106. PMLR, 2021. [79] Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016. [80] Yonghong Tian, Tiejun Huang, Menglin Jiang, and Wen Gao. Video copy-detection and localization with a scalable cascading framework. IEEE MultiMedia, 20(3):72–86, 2012. [81] Yonghong Tian, Mengren Qian, and Tiejun Huang. Tasc: A transformation-aware soft cascading approach for multimodal video copy detection. ACM Transactions on Information Systems (TOIS), 33(2):1–34, 2015. [82] Giorgos Tolias, Ronan Sicre, and Herv´e J´egou. Particular object retrieval with integral max-pooling of cnn activations. arXiv preprint arXiv:1511.05879, 2015. [83] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv´e J´egou. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR, 2021. [84] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015. [85] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, �Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. [86] Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, et al. Deep high-resolution representation learning for visual recognition. IEEE transactions on pattern analysis and machine intelligence, 43(10):3349–3364, 2020. [87] Tianyi Wang, Feipeng Ma, Zhenhua Liu, and Fengyun Rao. A dual-level detection method for video copy detection. arXiv preprint arXiv:2305.12361, 2023. [88] Xun Wang, Haozhi Zhang, Weilin Huang, and Matthew R Scott. Cross-batch memory for embedding learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6388–6397, 2020. [89] Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, and Saining Xie. Convnext v2: Co-designing and scaling convnets with masked autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16133–16142, 2023. [90] Xiao Wu, Alexander G Hauptmann, and Chong-Wah Ngo. Practical elimination of near-duplicates from web video search. In Proceedings of the 15th ACM international conference on Multimedia, pages 218–227, 2007. [91] Tien-Ju Yang, Andrew Howard, Bo Chen, Xiao Zhang, Alec Go, Mark Sandler, Vivienne Sze, and Hartwig Adam. Netadapt: Platform-aware neural network adaptation for mobile applications. In Proceedings of the European Conference on Computer Vision (ECCV), pages 285–300, 2018. [92] Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. Hierarchical attention networks for document classification. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pages 1480–1489, 2016. [93] Shuhei Yokoo. Contrastive learning with large memory bank and negative embedding subtraction for accurate copy detection. arXiv preprint arXiv:2112.04323, 2021. [94] Shuhei Yokoo, Peifei Zhu, Junki Ishikawa, and Rintaro Hasegawa. 3rd place solution to meta ai video similarity challenge. arXiv preprint arXiv:2304.11964, 2023. [95] Liang Zheng, Yali Zhao, Shengjin Wang, Jingdong Wang, and Qi Tian. Good practice in cnn feature transfer. arXiv preprint arXiv:1604.00133, 2016. [96] Wencheng Zhu, Jiwen Lu, Jiahao Li, and Jie Zhou. Dsnet: A flexible detect-to-summarize network for video summarization. IEEE Transactions on Image Processing, 30:948–962, 2020. |
| 論文全文使用權限 |
如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信