§ 瀏覽學位論文書目資料
  
系統識別號 U0002-0408202315041900
論文名稱(中文) 基於機器學習之早期預測學術文獻影響力研究
論文名稱(英文) Using Machine Learning for Early Academic Influence Prediction
第三語言論文名稱
校院名稱 淡江大學
系所名稱(中文) 資訊與圖書館學系碩士班
系所名稱(英文) Department of Information and Library Science
外國學位學校名稱
外國學位學院名稱
外國學位研究所名稱
學年度 111
學期 2
出版年 112
研究生(中文) 陳倩琪
研究生(英文) Sin Kei Chan
學號 610000092
學位類別 碩士
語言別 繁體中文
第二語言別
口試日期 2023-06-19
論文頁數 73頁
口試委員 指導教授 - 張嘉玲(lindam@mail.tku.edu.tw)
口試委員 - 劉譯閎
口試委員 - 林逸農
關鍵字(中) 學術影響力
早期預測
冷啟動問題
資料不平衡
關鍵字(英) Academic Influence
Early Prediction
Cold Start Problem
Data Imbalance
第三語言關鍵字
學科別分類
中文摘要
學術影響力(Academic influence)常用於評鑑學術領域或機構,藉以反映當前學術領域的發展現況。該指標的優秀之處可供借鑑,不足之處則可加以改良,進而促進學術進步,因此一直受到各界關注。相較於過去預測學術影響力,剛發表的文獻,因冷啟動問題無法作預測,本研究專注於早期預測學術文獻影響力,在文獻發表之時立即預測其未來是否能獲得高被引次數,藉以觀察當前學術領域的現況和發展趨勢,並為研究者、研究機構、期刊編輯與審稿人提供參考依據。
故此,本研究使用機器學習建立早期預測學術文獻影響力模型,以文獻內容、作者、期刊,三構面進行預測,其特色是加入Scopus與JCR的期刊評鑑指標,目的更全面地描述學術文獻影響力的構成以提高模型的預測表現。再者,鑒於學科領域的差異,本研究選擇了圖書資訊學領域作為研究對象。
在研究中,我們發現並解決了學術文獻影響力的資料不平衡問題。我們使用了SMOTE,並採用Ensemble的Stacking架構,在模型中第一階段使用Long Short-Term Memory、Multilayer Perceptron、Support Vector Machine、Logistic Regression、Random Forest、XGBoost、Naive Bayes作預測,第二階段採用Logistic Regression作預測。實驗結果顯示,本研究提出的方法優於其他演算法模型,並且具有能夠在文獻發表之時立即預測其未來是否能獲得高被引次數的能力。
英文摘要
Academic influence is frequently employed to assess academic domains or institutions, serving as a reflection of the current state of development within the academic field. As a result, it has garnered continuous attention from various sectors. Compared to previous efforts in predicting academic influence, recently published literature faces challenges related to Cold Start Problem that hinder accurate predictions. This study focuses on early academic influence prediction, aiming to predict at the time of publication whether a paper will receive high citation counts in the future. This approach allows for the observation of the current state and developmental trends in the academic field, providing a reference for researchers, research institutions, journal editors, and reviewers.
Therefore, a machine learning model is developed for early academic influence prediction for research papers. This model utilizes three dimensions: the content of the paper, the authors, and the journals. Its notable feature is the inclusion of Scopus and JCR (Journal Citation Reports) journal evaluation metrics. Considering the differences in various disciplines, this study chooses the field of Library and Information Science as its research focus.
In the study, we tackled data imbalance using the SMOTE approach and Ensemble Stacking. The model's first stage combined prediction methods including Long Short-Term Memory, Multilayer Perceptron, Support Vector Machine, Logistic Regression, Random Forest, XGBoost, and Naive Bayes. In the second stage, Logistic Regression was employed. Experimental results demonstrate the superiority of our proposed method over other algorithms. Furthermore, it accurately predicts high citation counts for future-published papers.
第三語言摘要
論文目次
目次
目次	v
圖目次	vii
表目次	viii
第一章 緒論	1
第一節	研究背景與動機	1
第二節	研究目的與問題	5
第三節	研究範圍與限制	6
第二章	文獻探討	7
一)	學術影響力重要指標相關研究	7
二)	預測學術影響力相關研究	9
三)	早期預測學術影響力相關研究	11
四)	被引時間區間相關研究(Citation time windows)	12
五)	小結	13
第三章	研究方法與設計	15
第一節	早期預測學術文獻影響力模型	15
第二節	研究蒐錄標準與範圍	16
第三節	研究流程	19
一、	構面與屬性	20
二、	資料蒐集	39
三、	資料前處理	39
四、	建模	47
五、	評估	49
六、	實驗	50
第四章	研究結果與討論	51
一)	圖資領域之被引用情況	51
二)	各演算法之預測模型結果	54
三)	資料不平衡演算法	56
四)	集成模型	60
五)	建立早期預測圖資領域學術文獻影響力模型之討論	63
第五章	結論	65
參考文獻	68
圖目次
圖1 早期預測學術文獻影響力集成模型	16
圖2 研究流程	20
圖3 2017學術文獻影響力分佈圖	41
圖4 2018學術文獻影響力分佈圖	42
圖5 2019學術文獻影響力分佈圖	43
圖6 2020學術文獻影響力分佈圖	44
圖7 2021學術文獻影響力分佈圖	45
圖8 2022學術文獻影響力分佈圖	46
圖9 2017-2022年圖資領域文獻各年被引次數	52
圖10 未採用過度採樣之提出模型之各年結果	55
圖11 單一演算法於SMOTE之各年AUC比較	60
圖12 2017-2022年文獻各年總被引次數	64
表目次
表1 研究收錄之期刊名單	17
表2 早期預測學術文獻影響力之文獻內容構面指標	22
表3 早期預測學術文獻影響力之作者構面指標	23
表4 JCR與Scopus之期刊指標解釋	26
表5 早期預測學術文獻影響力之期刊指標	29
表6 2017學術文獻影響力EM分類結果	41
表7 2018學術文獻影響力EM分類結果	42
表8 2019學術文獻影響力EM分類結果	43
表9 2020學術文獻影響力EM分類結果	44
表10 2021學術文獻影響力EM分類結果	45
表11 2022學術文獻影響力EM分類結果	46
表12 精確度與召回率預測方式	50
表13 SMOTE與ADASYN方法在單一演算法建模的比較	57
表14 SMOTE於單一演算法建模之準確度比較	59
表15 ADASYN於單一演算法建模之準確度比較	59
表16 2017-2022年單一演算法模型與集成模型之比較	62
參考文獻
Akella, A. P., Alhoori, H., Kondamudi, P. R., Freeman, C., & Zhou, H. (2021). Early indicators of scientific impact: Predicting citations with altmetrics. Journal of Informetrics, 15(2). https://doi.org/10.1016/j.joi.2020.101128
Aksnes, D. W. , Langfeldt, L., & Wouters, P. (2019). Citations, Citation Indicators, and Research Quality: An Overview of Basic Concepts and Theories. SAGE Open, 1-17. https://doi.org/10.1177/215824401982957
Bai, X., Zhang, F., & Lee, I. (2019). Predicting the citations of scholarly paper. Journal of Informetrics, 13, 407-418. https://doi.org/10.1016/j.joi.2019.01.010
Baumgartner, S. E., & Leydesdorff, L. (2014). Group-Based Trajectory Modeling (GBTM) of citations in scholarly literature: Dynamic qualities of “transient” and “sticky knowledge claims.” Journal of the Association for Information Science and Technology, 65, 797-811. https://doi.org/10.1002/asi.23009
Bertsimas, D., Brynjolfsson, E., Reichman, S., & Silberholz, J. (2015). OR Forum-Tenure Analytics: Models for Predicting Research Impact. Operations Research, 63(6), p. 1246-1261. https://doi.org/10.1287/opre.2015.1447
Bi, H. H. (2022). Four problems of the h‑index for assessing the research productivity and impact of individual authors. Scientometrics. https://doi.org/10.1007/s11192-022-04323-8
BinMakhashen, G. M., & Al-Jamimi, H. A. (2022). Evaluation of Machine Learning to Early Detection of Highly Cited Papers. 2022 7th International Conference on Data Science and Machine Learning Applications (CDMA). https://doi.org/10.1109/CDMA54072.2022.0000
Boser, B., Guyon, I., Vapnik, V. (1992). A training algorithm for optimal margin classifiers. Proceedings of the Fifth Annual Workshop on Computational Learning Theory, 144-152. https://doi.org/10.1145/130385.130401
Bouabid, H. (2011). Revisiting citation aging: a model for citation distribution and life-cycle prediction. Scientometrics, 88, 199-211. https://doi.org/10.1007/s11192-011-0370-5
Campanario, J. (2011). Empirical study of journal impact factors obtained using the classical two-year citation window versus a five-year citation window. Scientometrics, 87, 189-204. https://doi.org/10.1007/s11192-010-0334-1
Chen, T. & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785-794. https://doi.org/10.1145/2939672.2939785
Cox, D. (1959). The Regression Analysis of Binary Sequences. Journal of the Royal Statistical Society: Series B (Methodological), 21(1), 238-238. https://doi.org/10.1111/j.2517-6161.1959.tb00334.x
Galli, C., & Guizzardi, S. (2021). The Effect of Article Characteristics on Citation Number in a Diachronic Dataset of the Biomedical Literature on Chronic Inflammation: An Analysis by Ensemble Machines. Publications, 9, 15. https://doi.org/10.3390/publications9020015
Garfield, E. (2006). The History and Meaning of the Journal Impact Factor. JAMA, 295 (1), 90-93. https://doi.org/10.1001/jama.295.1.90
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321-357. https://doi.org/10.1613/jair.953
Glänzel, W., Schoepflin, U. (1995). A bibliometric study on ageing and reception processes of scientific literature. Journal of Information Science, 21(1), 37-53. https://doi.org/10.1177/016555159502100104
Clarivate (2021). Cited Half-Life. Clarivate InCites Help. https://incites.help.clarivate.com/Content/Indicators-Handbook/ih-cited-half-life.htm
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, 39(1), 1-38. https://doi.org/10.1111/j.2517-6161.1977.tb01600.x
He, H., Bai, Y., Garcia, E. A., & Li, S. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. The 2008 IEEE International Joint Conference on Neural Networks, 2008, p.358-1328. https://doi.org/10.1109/IJCNN.2008.4633969
Hochreiter, S. & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735-1780. https://doi.org/10.1162/neco.1997.9.8.1735
Huang, M. H. & Chang, C. P. (2016). A comparative study on three citation windows for detecting research fronts. Scientometrics, 109, 1835-1853. https://doi.org/10.1007/s11192-016-2133-9
Jacso, P. (2006). Five-year impact factor data in the Journal Citation Reports. Online Information Review, 33(3), 603-614. https://doi.org/10.1108/14684520910969989
Kanellos, I., Vergoulis, T., & Sacharidis, D. (2021). Ranking Papers by Expected Short-Term Impact. Springer, https://doi.org/10.1007/978-3-030-86668-6_5
Kuppler, M. (2022). Predicting the future impact of Computer Science researchers: Is there a gender bias? Scientometrics, 127, 6695–6732. https://doi.org/10.1007/s11192-022-04337-2
Laura L. (2023). QS World University Rankings methodology: Using rankings to start your university search. QS Quacquarelli Symonds Limited 1994. https://www.topuniversities.com/qs-world-university-rankings/methodology
Lee, D. H. (2019). Predicting the research performance of early career scientists. Scientometrics, 121, 1481-1504. https://doi.org/10.1007/s11192-019-03232-7
Leydesdorff, L., Wouters, P., & Bornmann, L. (2016). Professional and citizen bibliometrics: Complementarities and ambivalences in the development and use of indicators - A state-of-the-art report. Scientometrics, 109, 2129-2150. https://doi.org/10.1007/s11192-016-2150-8
Mingers, J, & Burrell, Q. L. (2006). Modeling citation behavior in Management Science journals. Information Processing and Management, 42, 1451-1464. https://doi.org/10.1016/j.ipm.2006.03.012
Nie, Y., Zhu, Y., Lin, Q. Zhang, S., Shi, P., & Niu, Z. (2019). Academic rising star prediction via scholar’s evaluation model and machine learning techniques.  Scientometrics, 120, 461-476. https://doi.org/10.1007/s11192-019-03131-x
NTU Library (2012年2月12日)。Essential Science Indicators(ESI)頂尖學術指標查詢。國立臺灣大學圖書館參考服務部落格。http://tul.blog.ntu.edu.tw/archives/3224
Cunil, O. M., Gonzalez, L. O., Santomil, P. D., & Forteza, C. M. (2023). How to accomplish a highly cited paper in the tourism, leisure and hospitality field. Journal of Business Research, 157. https://doi.org/10.1016/j.jbusres.2022.113619
Rose, M., & Kitchin, J. R. (2019). pybliometrics: Scriptable bibliometrics using a Python interface to Scopus. SoftwareX, 10, 100263. https://doi.org/10.1016/j.softx.2019.100263
Rosenblatt, F. (1957). The Perceptron, a Perceiving and Recognizing Automaton Project Para. Cornell Aeronautical Laboratory, 85, 460-461.
Shahzad, M., Alhoori, H., Freedman, R., & Rahman, S. A. (2022). Quantifying the online long-term interest in research. Journal of Informetrics, 16(2). https://doi.org/10.1016/j.joi.2022.101288
Skorikov, M., & Momen, S. (2020). Machine learning approach to predicting the acceptance of academic papers. The 2020 IEEE International Conference on Industry 4.0, Artificial Intelligence, and Communications Technology. https://doi.org/10.1109/IAICT50021.2020.9172011
Wang, F., Fan, Y., Zeng, A., & Di, Z. (2019). Can we predict ESI highly cited publications? Scientometrics, 118, 109-125. https://doi.org/10.1007/s11192-018-2965-6
Wang, M., Jiao, S., Zhang, J., Zhang, X., & Zhu, N. (2020). Identification High Influential Articles by Considering the Topic Characteristics of Articles. IEEEAccess, 8, 107887-107899. https://doi.org/10.1109/ACCESS.2020.3001190
Wang, M., Wang, Z., & Chen, G. (2019). Which can better predict the future success of articles? Bibliometric indices or alternative metrics. Scientometrics, 119, 1575-1595. https://doi.org/10.1007/s11192-019-03052-9
Wang, M., Yu, G., & Yu, D. (2011). Mining typical features for highly cited papers. Scientometrics, 87, 695-706. https://doi.org/10.1007/s11192-011-0366-1
Xu, J., Li, M., Jiang, J., Ge, B., & Cai, M. Early Prediction of Scientific Impact Based on Multi-Bibliographic Features and Convolutional Neural Network. IEEEAccess, 7, 92248- 92258. https://doi.org/10.1109/ACCESS.2019.2927011
Yu, T., Yu, G., Li, P., & Wang, L. (2014). Citation impact prediction for scientific papers using stepwise regression analysis. Scientometrics, 101, 1233-1252. https://doi.org/10.1007/s11192-014-1279-6
Zhao, F., Zhang, Y., Lu, J., & Shai, O. (2019). Measuring academic infuence using heterogeneous author-citation networks. Scientometrics, 118, 1119-1140. https://doi.org/10.1007/s11192-019-03010-5
尤玳琦、林雯瑤(2016)。圖書資訊學領域開放近用期刊之學術傳播速度:以論文初次被引用時間來衡量。圖書資訊學刊,14(1),151-179。https://doi.org/10.6182/jlis.2016.14(1).151
國家科學及技術委員會(2023)。傑出研究獎。國家科學及技術委員會。https://www.nstc.gov.tw/folksonomy/list/554e3625-b1d7-4a0d-9a70-2ffc81c90ab3?l=ch
林秋薰、陳光華(2022)。從數據分析面向探索圖書資訊學之現狀與發展趨勢。人文與社會科學簡訊,23(4),80-90。
論文全文使用權限
國家圖書館
同意無償授權國家圖書館,書目與全文電子檔於2028-08-14, 於網際網路公開,延後電子全文
校內
校內紙本論文立即公開
同意電子論文全文授權校園內公開
校內電子論文立即公開
校外
同意授權予資料庫廠商
校外電子論文延後至2028-08-14公開,延後電子全文

如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信