淡江大學覺生紀念圖書館 (TKU Library)
進階搜尋


下載電子全文限經由淡江IP使用) 
系統識別號 U0002-0309202012005200
中文論文名稱 設計及實作基於自然語言轉換SQL之問答機器人
英文論文名稱 Design and Implementation of Question and Answer Robot Based on Natural Language Using SQL Translation
校院名稱 淡江大學
系所名稱(中) 資訊工程學系碩士班
系所名稱(英) Department of Computer Science and Information Engineering
學年度 108
學期 2
出版年 109
研究生中文姓名 駱知昀
研究生英文姓名 Zhi-Yun Luo
電子信箱 jame1940517@gmail.com
學號 607410585
學位類別 碩士
語文別 中文
口試日期 2020-06-12
論文頁數 81頁
口試委員 指導教授-張志勇
指導教授-郭經華
委員-陳宗禧
委員-陳裕賢
中文關鍵字 人工智慧  結構化查詢機器人  資料庫問答機器人  模糊比對  同音異字比對 
英文關鍵字 Artificial intelligence  structured query robot  database question answering robot  fuzzy comparison  homophony and different words comparison 
學科別分類 學科別應用科學資訊工程
中文摘要 現今很多資料查詢、報表設計及程式運作的功能都需要在前端提供使用者操作的選單,而在後端則依靠關聯式資料庫系統來管理。使用者在前端做表單式的查詢時,往往都會遇到選單操作複雜的問題,同時也可能使用者想查的資料並未出現在表單的選項中。為了讓使用者能夠更方便的查詢資料,有許多學者開始研究不同的資料庫問答型的QA機器人,以代替表單式操作。
傳統問答型的QA機器人,是基於使用者的問題,給定一個回答。在建立QA系統時,一定是要有一個完整的QA問答集,而這個問答集也應包含大量的問題與答案,才能夠有效的讓神經網路學習到問題與答案之間的聯繫關係。但是對於資料庫問答型的QA機器人,在訓練神經網路模型時,會遇到資料庫本身只有資料表,並無問答集,導致機器人無法學習資料庫的問題與資料庫查詢結果之間的關聯性。而對於一般問答型的QA機器人,當使用者提出問題,一定會給予一個答案,但這個答案的正確性則是以使用者主觀評價為標準,因為使用者有時提出的問題不完整,使得QA機器人在自然語言處理(Natural Language Processing, NLP)上,大多會採用模糊匹對的方式去搜尋,最後將得到最相近的結果去回復使用者一個答案。而對於資料庫問答型的QA機器人,使用者提出的問題一定要描述得很完整,才能夠使QA機器人精確的利用SQL語言/指令來查詢資料庫,並顯示結果,而這個結果的正確性則是以客觀的評價為標準,只有查詢正確與錯誤,所以使用者在問題的描述上,除了要具有完整性以外,敘述的文字也一定要與資料庫的資料相同,否則也會導致QA機器人無法查詢。這些問題都是發展資料庫QA機器人所會遭遇的挑戰。
綜合上述所提的問題,本論文將設計一個資料庫問答型的QA機器人,不但能夠讓使用者語音輸入,也可以自動將自然語言轉換成正確的資料庫系統語言,並且再將查詢到的結果快速地回傳給使用者,藉此有效的提高資料庫查詢的效率。但是,在設計資料庫問答型的QA機器人的過程中,將可能產生下列四項的挑戰,第一項是資料庫問答型機器人本身並無QA資料集,第二點是同音異字的問題,一旦語音轉文字時,因同音異字與資料庫欄名不同,將使SQL指令無法查出使用者擬查的資料;第三點是使用者詢問的問題不完整,第四點是將自然語言轉換成SQL查詢的語句。
針對以上挑戰,本論文透過爬蟲得到的各種不同的問句以及利用斷詞斷句標註重要詞語,並將這些詞替換成資料庫裡的欄位名稱,建立成新的資料庫問題集,再將這些問題利用BERT來訓練出一個自然語言轉中間語言的模型,可以得到使用者欲查詢的資料庫欄位有哪些,最後再將這些問句所轉成的中間語言,轉譯為SQL指令,藉此解決資料庫無問答集的問題。
本論文也透過爬蟲將國語辭典裡所有注音拼音的字抓取下來並建立一個同音異字庫,再利用CNN來訓練一個能將同音異字轉為正確資料欄位名稱,可以解決同音異字的語音轉文字問題。接著本論文利用FuzzyWuzzy建立了一套模糊比對算法,來針對當有使用者詢問的問題不完整時,也可以清楚得到該問題所想查詢的資料庫欄位名稱是甚麼。最後,本論文使用LSTM來做BERT模型的下游任務,也就是將BERT模型預測完的結果,再利用LSTM轉換成SQL的查詢語言。
根據實驗數據顯示,透過以上四種挑戰的功能,本論文相較於其他資料庫問答型的QA機器人更能夠針對使用者問題不完整時,能更有效且精準的查詢到資料庫的結果,進而以自動化及智慧化的方式,協助用戶利用自然語言詢問,能夠輕鬆地得到資料庫的資料,同時也能提升查詢和分析數據的效率。
英文摘要 Nowadays, many functions of data query、report design and program operation need to provide user operation menu in the front end, while in the back end, it depends on the relational database system to manage. Users often encounter the problem of complex menu operation when doing form query in the front end. At the same time, the data that users want to check does not appear in the form options. In order to enable users to query data more conveniently, many scholars have begun to study different QA robots of database to replace form operation.
The traditional QA robot is based on the user's question and gives an answer. In the establishment of QA system, there must be a complete QA question and answer set, which should also contain a large number of questions and answers, in order to effectively let neural network learn the relationship between questions and answers. However, when training the neural network model for QA robot of database question answering type, it will encounter that the database itself has only data table and no question and answer set, which makes the robot unable to learn the relationship between the database query results and the problem. For general QA robots, when users ask questions, they will give an answer, but the correctness of the answer is based on the subjective evaluation of users, because sometimes the questions raised by users are not complete, which makes QA robots in natural language processing (NLP), In NLP, most of them use fuzzy matching method to search, and finally get the closest result to reply to the user. For QA robot of database question answering type, the user's question must be described completely, so that QA robot can query database accurately by SQL language / instruction and display the result. The correctness of the result is based on objective evaluation, only the query is correct and wrong. Therefore, the user should have integrity in the description of the problem In addition, the narrative text must be the same as the data in the database, otherwise, QA robots will not be able to query. These problems are the challenges that the development of database QA robots will encounter.
Based on the above questions, this paper will design a QA robot based on database, which can not only let users input voice, but also automatically convert natural language into correct database system language, and then quickly return the query results to users, so as to effectively improve the efficiency of database query. However, in the process of designing a QA robot for database question answering, the following four challenges may arise. The first is that the database QA robot itself does not have a QA data set. The second point is the problem of homonyms and different characters. Once the voice is converted to text, the SQL instruction will not be able to find the data that the user intends to query due to the difference between the homonym and the column name of the database; and the third is the user The fourth point is to convert natural language into SQL query statements.
In view of the above challenges, this paper uses the crawler to get various questions, and uses broken words and sentences to mark important words, and replaces these words with the field names in the database to build a new database problem set. Then, we use the Bert to train a natural language to intermediate language model, and we can get what database fields users want to query Finally, the intermediate language of these questions is translated into SQL instructions to solve the problem of no question answering set in the database.
This paper also uses the crawler to capture all the phonetic and Pinyin characters in the Mandarin dictionary and establish a homophony and heteronym database. Then, we use CNN to train a character that can convert homophones into correct data field names, which can solve the problem of phonetic conversion from homophones to characters. Then, this paper uses fuzzy wuzzy to establish a set of fuzzy comparison algorithm, which can clearly get the database field name that the question wants to query when the user's question is not complete. Finally, this paper uses LSTM to do the downstream task of the BET model, that is, the predicted results of the Bert model are transformed into SQL query language by using LSTM.
According to the experimental data, through the above four kinds of challenge functions, this paper compared with other database QA robots can more effectively and accurately query the results of the database when the user's questions are incomplete, and then assist users to query with natural language in an automatic and intelligent way, and can easily get the data of the database It can also improve the efficiency of data query and analysis.
論文目次 目錄
目錄 VIII
圖目錄 IX
表目錄 XI
第一章、簡介 1
第二章、相關研究 7
第三章、背景知識 13
3.1 FuzzyWuzzy和同音異字技術 15
3.2 LSTM技術 16
3.3 BERT技術 17
第四章、系統架構 22
4.1 環境與問題描述 22
4.1.1欲解決的問題 22
4.1.2目標 25
4.2 系統架構 27
4.2.1訓練資料 28
4.2.2問題辨識 31
4.2.3 SQL語言轉換 38
第五章、系統實作 41
第六章、實驗分析 52
第七章、結論 56
參考文獻 57
附錄-英文論文 59


圖目錄
圖1:系統分為資料處理、問題辨識以及自然語言轉換SQL指令 13
圖2:長短期記憶模型(LSTM) 17
圖3:遮罩語言模型(Masked LM) 19
圖4:下一句子預測(Next Sentence Prediction) 19
圖5:BERT之下游任務模型 21
圖6:課程查詢系統選單操作複雜 23
圖7:內政部統計查詢網選單操作複雜 23
圖8:同音異字的問題 24
圖9:統計性查詢的問題 24
圖10:設計及實作基於自然語言轉換SQL之問答機器人~以課程查詢為例 26
圖11:設計及實作基於自然語言轉換SQL之問答機器人~以統計查詢為例 27
圖12:系統架構圖 28
圖13:訓練資料架構圖 29
圖14:網路爬蟲獲取各個不同的問題 30
圖15:網路爬蟲獲取各個不同的問題 31
圖16:問題辨識模組 32
圖17:BERT模型分類 33
圖18:模糊比對演算法 34
圖19:建立同音異字庫 35
圖20:訓練CNN神經網路模型與同音異字庫比對 36
圖 21:自定義日期規則 37
圖22:自定義統計函數規則 38
圖23:SQL語言轉換器 39
圖24:LSTM模型將自然語言轉換SQL指令 40
圖25:校園課程查詢資料庫 41
圖26:內政部統計查詢資料庫 42
圖27:自動產生大量詢問的查詢訓練資料之演算法流程 43
圖28:課程資訊查詢的訓練資料 44
圖29:內政部統計查詢的訓練資料 45
圖30:自定義日期與統計函數的轉換格式規則 45
圖31:BERT模型分類結果 46
圖32:模糊演算法結果 47
圖33:SQL指令轉換並且查詢的結果 48
圖34:將使用者查詢的問題轉換成SQL指令結果的流程 49
圖35:課程資訊查詢問題轉換成SQL指令查詢的結果 50
圖36:同音異字比對的結果 51
圖37:問題試驗結果~以課程資訊查詢為例 53
圖38:問題試驗結果~以內政部統計查詢為例 54
圖39:Precision以及Recall 55

表目錄
表 1:相關研究功能比較表 12

參考文獻 [1] Victor Zhong, Caiming Xiong, Richard Socher "Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning," arXiv:1709.00103v7 [cs.CL] 9 Nov 2017
[2] Xiaojun Xu, Chang Liu, Dawn Song "SQLNet: Generating Structured Queries From Natural Language Without Reinforcement Learning," arXiv:1711.04436v1 [cs.CL] 13 Nov 2017
[3] Wonseok Hwang, Jinyeong Yim, Seunghyun Park, Minjoon Seo "A Comprehensive Exploration on WikiSQLwith Table-Aware Word Contextualization," arXiv:1902.01069v2 [cs.CL] 11 Nov 2019
[4] Z. Liao et al., "Medical Data Inquiry Using a Question Answering Model," 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI), Iowa City, IA, USA, 2020, pp. 1490-1493, doi: 10.1109/ISBI45749.2020.9098531.
[5] Tao Yu, Zifan Li, Zilin Zhang, Rui Zhang, Dragomir Radev "TypeSQL: Knowledge-based Type-Aware Neural Text-to-SQL Generation," arXiv:1804.09769v1 [cs.CL] 25 Apr 2018
[6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova "Bert: Pretraining of deep bidirectional transformers for language understanding," arXiv:1810.04805v2 [cs.CL] 24 May 2019
[7] D. T. Hoang, M. L. Nguyen and S. B. Pham, "L2S: Transforming Natural Language Questions into SQL Queries," 2015 Seventh International Conference on Knowledge and Systems Engineering (KSE), Ho Chi Minh City, 2015, pp. 85-90, doi: 10.1109/KSE.2015.38.
[8] F. A. Gers, J. Schmidhuber and F. Cummins, "Learning to forget: continual prediction with LSTM," 1999 Ninth International Conference on Artificial Neural Networks ICANN 99. (Conf. Publ. No. 470), Edinburgh, UK, 1999, pp. 850-855 vol.2, doi: 10.1049/cp:19991218.
[9] Xiangzhou Huang, Yin Zhang, Baogang Wei and Liang Yao, "A question-answering system over Traditional Chinese Medicine," 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Washington, DC, 2015, pp. 1737-1739, doi: 10.1109/BIBM.2015.7359945.
[10] E. Damiano, R. Spinelli, M. Esposito and G. De Pietro, "Towards a Framework for Closed-Domain Question Answering in Italian," 2016 12th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS), Naples, 2016, pp. 604-611, doi: 10.1109/SITIS.2016.100.
[11] L. Yujian and L. Bo, "A Normalized Levenshtein Distance Metric," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 6, pp. 1091-1095, June 2007, doi: 10.1109/TPAMI.2007.1078.

論文使用權限
  • 同意紙本無償授權給館內讀者為學術之目的重製使用,於2020-09-04公開。
  • 同意授權瀏覽/列印電子全文服務,於2020-09-04起公開。


  • 若您有任何疑問,請與我們聯絡!
    圖書館: 請來電 (02)2621-5656 轉 2487 或 來信