§ 瀏覽學位論文書目資料
  
系統識別號 U0002-2909202007584900
DOI 10.6846/TKU.2020.00879
論文名稱(中文) 基於深度學習之食品廣告違規識別
論文名稱(英文) Identification of Food Advertising Violation Based on Deep Learning
第三語言論文名稱
校院名稱 淡江大學
系所名稱(中文) 資訊工程學系碩士班
系所名稱(英文) Department of Computer Science and Information Engineering
外國學位學校名稱
外國學位學院名稱
外國學位研究所名稱
學年度 108
學期 2
出版年 109
研究生(中文) 曾浚宥
研究生(英文) Jiun-You Tzeng
學號 607410536
學位類別 碩士
語言別 繁體中文
第二語言別 英文
口試日期 2020-07-10
論文頁數 40頁
口試委員 指導教授 - 張志勇(cychang@mail.tku.edu.tw)
指導教授 - 郭經華(chkuo@mail.tku.edu.tw)
委員 - 廖文華
委員 - 游國忠
關鍵字(中) 敏感字識別
違規字識別
實體辨識
BERT
關鍵字(英) Sensitive Word recognition
Illegal Word recognition
Entity recognition
BERT
第三語言關鍵字
學科別分類
中文摘要
隨著網際網路普及,大部分人們的專注從傳統媒體如:報紙、傳單、電視,轉向網際網路的社群媒體,各類廣告也跟隨潮流從傳統媒體轉向可以帶來大量點閱的社群媒體如:Google、Facebook、Yahoo,因為社群媒體的即時性,這些廣告也帶來了問題,一則廣告如果含有不當內容,也就是法律規定不允許出現的廣告用語等,這類型的廣告稱為違規廣告,這些違規廣告經過網際網路的傳播,可能對人們造成影響,例如:食品廣告從過去傳統媒體轉換到社群媒體,變得更容易接觸到大眾,而食品廣告中不乏有廣告內容宣稱吃了廣告內的食品,就可以達到治療氣喘、癌症、美白等醫療效果,誘使消費者購買,不但該食品除了沒有醫療效果,也可能含有不明成分導致消費者身體受損,而廣告所宣稱的醫療效果依法律規定也不應該出現在食品廣告。
研究[1]總結兩種常用的違規識別方法,一種是收集大量詞彙形成一部字典,並使用比對演算法將文字內容與字典進行比對;另一個是透過詞向量技術,將字典轉為詞向量,製作可用的訓練資料,並使用機器學習或深度學習,如:貝葉斯分類器或卷積神經網路等方法進行分類模型訓練,將內容中的違規詞識別出來。前者方法,通常用於專業領域,不僅可以為特定領域量身打造比對演算法,在比較特定領域的準確率,也因為比對演算法可以根據專業知識進行分析,在違規識別的準確率也贏過後者方法,但缺點是設計比對演算法具有一定門檻,必須懂特定領域的知識;後者方法則是使用機器學習或深度學習從大量文字與標籤中分析出違規詞的特徵。由於透過詞向量的特徵進行違規識別,在應用領域方面不再局限於專業領域,在使用方面,通用性較高。
    雖然上述方法可以有效進行食品廣告違規識別,但仍面臨下列挑戰,違規食品廣告與正規食品廣告的數量相比懸殊,違規廣告數量與正規食品廣告的數量相比過少,而深度學習是一種仰賴大量資料來解決問題的方法,如果使用深度學習來解決食品廣告違規識別的問題,勢必面臨正規食品廣告多、違規食品廣告少,這種資料不均衡的困境,這種情況可能導致深度學習學到的違規詞數量太少,仍有違規詞無法識別出來;食品廣告的違規詞有前後文關聯,例如:「增加體力」與「增加免疫力」這兩種廣告用語,一樣都有「增加」一詞,前者的是合法,而後者卻是違規,原因是「體力」泛指人的動力來源,就像車與汽油,越多汽油,車就可以行駛越遠,同理,人的體力越多就能做越多事,而食品廣告中是不允許出現任何有關生理與醫療方面的用語,所以「免疫力」因為涉及生理功能而認定是違規,因為免疫力代表著可以預防疾病。從上述例子可以得知前後文的搭配,可以影響違規的認定。
為了解決上述面臨的挑戰,以下將針對各個挑戰提出應對的解決方法。
挑戰一:違規食品廣告與正規食品廣告的數量相比懸殊
為了解決違規廣告數遠低於正規廣告數的問題,透過同義詞庫將將違規廣告的廣告用語進行替換,透過這個方式可以大量產出違規廣告以平衡違規與正規廣告的數量。
挑戰二:食品廣告的違規詞有前後文關聯
食品廣告的違規詞有前後文關聯,例如:「增加體力」與「增加免疫力」這兩種廣告用語,一樣都有「增加」一詞,前者的是合法,而後者卻是違規,原因是「體力」泛指人的動力來源,就像車與汽油,越多汽油,車就可以行駛越遠,同理,人的體力越多就能做越多事,而食品廣告中是不允許出現任何有關生理與醫療方面的用語,因為免疫力代表著可以預防疾病,所以「免疫力」會因為涉及生理功能而被認定是違規。為了解決上述前後文關聯的問題,本論文提出的方法選用BERT模型,因為BERT模型可以透過詞性標註,學習詞與詞之間的前後文關係是否造成違規。
    本論文將透過爬蟲收集食品廣告違規案例,建立食品相關的違規詞庫與食品違規詞相關的同義詞庫,並查詢同義詞庫為食品違規案例中的違規詞與組合型違規詞進行BIO詞性標注,形成食品廣告違規識別的訓練資料。隨後將訓練資料匯入BERT進行實體辨識訓練,透過以上方法完成實體辨識訓練的BERT模型可以有效識別食品廣告中的違規詞。
英文摘要
With the popularity of the Internet, most people's attention has shifted from traditional media, such as newspapers, flyers, and television, to the social media on the Internet. All kinds of advertisements have also followed the trend, from traditional media to social media that can bring a lot of reading, such as Google, Facebook and Yahoo It contains improper content, which may affect people in the first time, or advertising language that is not allowed by law. This type of advertisement is called illegal advertisement. These illegal advertisements may affect people through the Internet. For example, food advertising has changed from traditional media to social media, making it easier to reach the general public, and there are many advertisements in the food advertisements claiming that eating the food in the advertisements can achieve the medical effects of treating asthma, cancer and whitening, and induce consumers to buy. Not only does the food have no medical effect, but also may contain unknown ingredients, which may cause the body damage of consumers. The medical effect claimed in the advertisement should not appear in the food advertisement according to the law.
    In the research[1], two common methods of illegal identification are summarized. One is to collect a large number of words to form a dictionary, and then use the comparison algorithm to compare the text content with the dictionary; the other is to use the word vector technology to convert the dictionary into the word vector to make available training materials, and use machine learning or deep learning, such as Bayesian classifier or convolutional neural network Training the classification model to identify the illegal words in the content. The former method, which is usually used in professional fields, can not only tailor-made comparison algorithms for specific fields, but also compare the accuracy of specific fields. Because the comparison algorithms can be analyzed according to professional knowledge, the accuracy rate of violation identification is also better than the latter method. However, there is a certain threshold for the design of comparison algorithm, and it is necessary to understand the knowledge of specific field; the latter method is based on the analysis of professional knowledge It uses machine learning or deep learning to analyze the features of illegal words from a large number of words and labels. Because of using the features of the word vector to identify violations, it is no longer limited to the professional field in the field of application, and has high universality in the use.
    Although the above methods can effectively identify illegal food advertisements, they still face the following challenges: the number of illegal food advertisements is significantly different from the number of regular food advertisements, and the number of illegal advertisements is too small compared with the number of regular food advertisements. Deep learning is a method that relies on a lot of information to solve the problem. If we use deep learning to solve the problem of food advertising violation identification There are many regular food advertisements and few illegal food advertisements. This situation may lead to too few illegal words learned by in-depth learning, and some illegal words can not be identified. For example, the two advertising terms of "increase physical strength" and "increase immunity" are the same The former is legal, while the latter is illegal. The reason is that "physical strength" generally refers to the source of human power, just like cars and gasoline. The more gasoline a car has, the farther the car can travel. Similarly, the more physical strength a person has, the more he can do. However, food advertisements do not allow any physiological and medical terms, so "immunity" is identified because it involves physiological functions It's against the law, because immunity represents the prevention of disease. From the above examples, we can know that the collocation of the preceding and the following can affect the determination of violations. Sign. Because of using the features of the word vector to identify violations, it is no longer limited to the professional field in the field of application, and has high universality in the use.
    In order to solve the above challenges, the following will propose solutions for each challenge.
Challenge 1: the number of illegal food advertisements is significantly different from that of regular food advertisements
    In order to solve the problem that the number of illegal advertisements is far less than the number of regular advertisements, the synonymous lexicon is used to replace the advertising terms of illegal advertisements. Through this way, a large number of illegal advertisements can be produced to balance the number of illegal and regular advertisements.
Challenge 2: The illegal words in food advertisements are related to context
    The illegal words in food advertisements are related to context. For example, the two advertising words "increase physical strength" and "increase immunity" both have the word "increase". The former is legal, while the latter is illegal. The reason is "Physical strength" generally refers to the source of human power, like cars and gasoline. The more gasoline, the farther the car can travel. Similarly, the more physical strength a person can do, the more things they can do, and food advertisements are not allowed Any terms related to physiology and medical treatment, because immunity means that diseases can be prevented, so "immunity" will be considered a violation because it involves physiological functions. In order to solve the above-mentioned context-related problems, the method proposed in this paper uses the BERT model, because the BERT model can learn whether the contextual relationship between words and words causes violations through part-of-speech tagging.
    In this paper, crawlers will be used to collect illegal cases in food advertisements, establish food related thesaurus and synonyms related to food illegal words, and query thesaurus to carry out bio part of speech tagging for illegal words and combined illegal words in food violation cases, so as to form training materials for food advertising violation identification. Then, the training data are imported into the BERT model for entity identification training. The BERT model, which completes entity identification training by the above methods, can effectively identify illegal words in food advertisements.
第三語言摘要
論文目次
目錄
目錄 X
圖目錄 XII
表目錄 XIII
第一章、簡介 1
第二章、相關研究 5
第三章、背景知識 8
3-1模組與技術介紹 8
3-1-1資料探勘 9
3-1-2影像處理 10
3-1-3自然語言處理 10
第四章、系統架構 14
4-1問題描述與目標 14
4-1-1解決的問題 14
4-1-2目標 14
4-2系統架構 15
第五章、實驗分析 21
第六章、結論 24
參考文獻 25
附錄-英文論文 27

圖目錄
圖 1食品廣告違規識別系統之架構圖 15
圖 2資料收集階段 16
圖 3資料處理階段 17
圖 4文字處理流程 18
圖 5比對與標註流程 19
圖 6模型訓練階段 19
圖 7訓練模型與輸入輸出 20
圖 8三種測試資料集違規識別效能圖 21
圖 9食品廣告違規識別效能圖 23

表目錄
表 1本論文與相關研究之比較圖 7
表 2深度模型比較表 12
表 3混淆矩陣表格 22
參考文獻
[1]	G. Xu, C. Qi, H. Yu, S. Xu, C. Zhao and J. Yuan, "Detecting Sensitive Information of Unstructured Text Using Convolutional Neural Network," 2019 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC), Guilin, China, 2019, pp. 474-479
[2]	Y. Fu, Y. Yu and X. Wu, "A Sensitive Word Detection Method Based on Variants Recognition," 2019 International Conference on Machine Learning, Big Data and Business Intelligence (MLBDBI), Taiyuan, China, 2019, pp. 47-52.
[3]	G. Xu, X. Wu, H. Yao, F. Li and Z. Yu, "Research on Topic Recognition of Network Sensitive Information Based on SW-LDA Model," in IEEE Access, vol. 7, pp. 21527-21538, 2019.
[4]	Y. Wang, H. Guo and Q. Song, "An Intelligent System for Detecting Illegal Words in Online Advertisement," 2017 10th International Symposium on Computational Intelligence and Design (ISCID), Hangzhou, 2017, pp. 391-395.
[5]	Web Crawler,
https://wiki.mbalib.com/zh-tw/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB
[6]	Optical Character Recognition,
https://zh.wikipedia.org/wiki/%E5%85%89%E5%AD%A6%E5%AD%97%E7%AC%A6%E8%AF%86%E5%88%AB
[7]	Devlin, J., Chang, M., Lee, K., & Toutanova, K., “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” NAACL-HLT, 2019.
[8]	Text segmentation, https://en.wikipedia.org/wiki/Text_segmentation
[9]	CkipTagger, https://iptt.sinica.edu.tw/shares/928
[10]	Thesaurus, https://zh.wikipedia.org/wiki/%E7%B4%A2%E5%BC%95%E5%85%B8
[11]	Department of Healthy, Taipei City Government, https://health.gov.taipei/News.aspx?n=13A23138C06A3532&sms=8E7386D329C4B210
[12]	Medical terms, http://163.16.60.27/~atlantia/learn-more/hospital.htm
[13]	Traditional Chinese medicine,
http://cht.a-hospital.com/w/%E4%B8%AD%E8%8D%AF%E5%9B%BE%E5%85%B8
[14]	Mandarin Dictionary, http://dict.revised.moe.edu.tw/cbdic/
論文全文使用權限
校內
校內紙本論文立即公開
同意電子論文全文授權校園內公開
校內電子論文立即公開
校外
同意授權
校外電子論文立即公開

如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信