تشخیص سرقت ادبی در متون علمی مبتنی بر بلوک‌بندی متن و معیار مشابهت کسینوسی

نوع مقاله : مقاله پژوهشی

نویسندگان

گروه مهندسی کامپیوتر، موسسه آموزش عالی نقش جهان، اصفهان، ایران

چکیده

در دهه اخیر با گسترش دسترسی به شبکه جهانی اینترنت، سرعت و سهولت در دسترسی به ایده‌ها، مستندات، مقالات، دست نوشته‌ها و داده‌های جمع‌آوری شده توسط دیگران افزایش یافته است. این موضوع باعث شده است که تبادل اطلاعات و افکار بین محققین و تولیدکنندگان علوم آسان‌تر شود، اما در مقابل باعث آسان شدن اعمال رونوشت غیرمجاز، خلاصه نویسی بدون ذکر منبع و در کل سرقت متون ادبی شده است. از آنجایی که دانشگاه‌ها و مراکز آموزشی، منابع علمی و پژوهشی را با سهولت در دسترس اغلب کاربران قرار می‌دهند، تشخیص میزان اصالت متون علمی در این مراکز مهم‌تر و بالطبع آن از حساسیت بیشتری برخوردار است. در این پژوهش روشی ارائه شده تا با استفاده از بلاک‌بندی قطعات اسناد، مقایسه بین قطعات مرتبط انجام شود. در روش پیشنهادی پس از دسته‌بندی اسناد به دو دسته اسناد اصلی و اسناد مشکوک، پیش‌پردازشی با هدف حذف ایست واژه‌ها و جمله‌بندی جدید صورت پذیرفته است. سپس اسناد قطعه‌بندی شده و با استفاده از شباهت کسینوسی، میزان شباهت متون با یکدیگر تعیین شده است. روش پیشنهادی در آزمون 50 سند موجود در مجموعه داده‌ها، دقت 94 درصدی را کسب کرده که به نسبت به یکی از روش‌های مشابه بهبود 2 درصدی داشته است.

کلیدواژه‌ها


عنوان مقاله [English]

Detection of plagiarism in scientific texts based on text blocking and cosine similarity criteria

نویسندگان [English]

  • Negar Majma
  • Sara Bashtin
Department of Computer Engineering, Naqsh Jahan Institute of Higher Education, Isfahan, Iran
چکیده [English]

In the last decade, with the expansion of the World Wide Web, the speed and ease of access to ideas, documents, articles, manuscripts, and data collected by others has increased. This has made the exchange of information and ideas between researchers and producers of science easier, but on the other hand, it has made it easier to apply unauthorized copies, write summaries without mentioning the source, and steal literary texts in general. Since universities and educational centers make scientific and research resources easily available to most users, recognizing the authenticity of scientific texts in these centers is more important and, of course, more sensitive. In this research, a method is presented to compare the related parts using the blocking of document parts. In the proposed method, after classifying the documents into two categories of main documents and suspicious documents, preprocessing has been done with the aim of eliminating word stops and new wording. Then the documents are segmented and using cosine similarity, the degree of similarity of the texts with each other is determined. The proposed method in the test of 50 documents in the data set has an accuracy of 94%, which is an improvement of 2% compared to one of the similar methods.

کلیدواژه‌ها [English]

  • Plagiarism
  • Recognizing the authenticity of scientific texts
  • Cosine distance
  • Block text
[1] G. Sarwar, C. O’Riordan, and J. Newell, “Passage Level Evidence for Effective Document Level Retrieval,” Proc. 9th Int. Joint Conf.Knowl. Discov. Knowl. Eng. Knowl. Manag. (KDIR), pp. 83-90, 2017, doi: 10.5220/0006502800830090.
[2] I.     Jaric, “High Time for a common Plagiarism Detection System,” Scientometrics, vol. 106, no. 1, pp. 457-459, 2016, doi: 10.1007/s11192-015-1756-6.
[3] I. Masic, E. Begic, and A. Dobraca, “Plagiarism Detection by Online Solutions,” Informatics Empowers Healthcare Transformation, ICIMTH 2017, Athens, Greece, pp. 227-230, 2017, doi: 10.3233/978-1-61499-781-8-227.
[4] V. Kanjirangat and D. Gupta, “Study on Extrinsic Text Plagiarism Detection Techniques and Tools,” J. Eng. Sci. Technol. Rev., vol. 9, no. 5, pp. 8-22, 2016, doi: 10.25103/jestr.095.02.
[5] S.M. Alzahrani, N. Salim, and A. Abraham, “Understanding Plagiarism Linguistic Patterns, Textual Features, and Detection Methods,” IEEE Trans. Sys. Man Cybern., Part C, vol. 42, no. 2, pp. 133-14, 2012, doi: 10.1109/TSMCC.2011.2134847.
[6] M. Potthast, A. Barron-Cedeno, B. Stein, and P. Rosso, “Cross-language Plagiarism Detection,” Lang. Resour. Evaluation, vol. 45, no. 1, pp. 45-62, 2011, doi: 10.1007/s10579-009-9114-z.
[7] C. Grozea, C. Gehl, and M. Popescu, “ENCOPLOT: Pairwise Sequence Matching in Linear Time Applied to Plagiarism Detection,” 3rd PAN Workshop Uncovering Plagiarism, Authorship and Social Software Misuse, WUPAS, pp. 10-18, 2009.
[8] C. Basile, D. Benedetto, E. Caglioti, G. Cristadoro, and M. Esposti, “A Plagiarism Detection Procedure in Three Steps: Selection, Matches and Squares,” Proc. SEPLN, pp. 19-23, 2009.
[9] M. Elhadi and A. Al-Tobi, “Duplicate Detection in Documents and Webpages using Improved Longest Common Subsequence and Documents Syntactical Structures,” 4th Int. Conf. Comput. Sci. Convergence Inf. Technol., pp. 679-684, 2009, doi: 10.1109/ICCIT.2009.235.
[10] K. Koroutchev and M. Cebrian, “Detecting Translations of the Same Text and Data with Common Source,” J. Stat. Mech. Theory Exp., vol. 10, 2006, doi: 10.1088/1742-5468/2006/10/P10009.
[11] Y. Li, D. McLean, Z.A. Bandar, J.D. O’shea, and K. Crockett, “Sentence Similarity Based on Semantic Nets and Corpus Statistics,” IEEE trans. Knowl. Data Eng., vol. 18, no.8, pp.1138-1150, 2006, doi: 10.1109/TKDE.2006.130.
[12] S. Gruner and S. Naven, “Tool Support for Plagiarism Detection in Text Documents,” Proc. ACM SAC, Santa Fe, New Mexico, USA, pp. 776-781, 2005, doi: 10.1145/1066677.1066854.
[13] S. Alzahrani and N. Salim, “Statement-based Fuzzy-set IR Versus Fingerprints Matching for Plagiarism Detection in Arabic Documents,” Proceedings 5th Postgraduate Annual Research Seminar (RARS 09), pp. 267-268, 2009.
[14] A.H. Osman, N. Salim, and A. Abuobieda, “Survey of Text Plagiarism Detection,” Comput. Eng. Appl. J., vol. 1, no. 1, pp. 37-45, 2012, doi: 10.18495/comengapp.v1i1.5.
[15] T.W. Chow and M. Rahman, “Multilayer SOM with Tree-Structured Data for Efficient Document Retrieval and Plagiarism Detection,” IEEE Trans. Neural Networks, vol. 20, no. 9, pp. 1385-1402, 2009, doi: 10.1109/TNN.2009.2023394.
[16] S. Rakian, E.F. Safi, and H. Rastegari, “A Persian Fuzzy Plagiarism detection Approach,” J. Inf. Sys. Telecommun., vol. 3, no. 11 pp. 182-190, 2015, doi: 10.7508/jist.2015.03.007.
[17] D. Gupta, K. Vani, and L. Leema, “Plagiarism Detection in Text Documents using Sentence Bounded Stop word N-Grams,” J. Eng. Sci. Technol., vol. 11, no. 10, pp. 1403-1420, 2016.
[18] M.A. Sanchez-Perez, A. Gelbukh, G. Sidorov, and H. Gomez-Adorno, “Plagiarism Detection with Genetic-based Parameter Tuning,” Int. J. Pattern Recognit. Artif. Intell., vol. 32, no, 1, 2018, doi: 10.1142/S0218001418600066.
[19] M. Schubotz, O. Teschke, V. Stange, N. Meuschke, and B. Gipp, “Forms of Plagiarism in Digital Mathematical Libraries,” Int. Conf. Intell. Comput. Math., pp. 258-274, 2019, doi: 10.1007/978-3-030-23250-4\18.
[20] C.-Y. Chang, S.-J. Lee, C.-H. Wu, C.-F. Liu, and C.-K. Liu, “Using Word Semantic Concepts for Plagiarism Detection in Text Documents,” Inf. Retr. J., vol. 24, no. 4, pp. 298-321, 2021, doi: 10.1007/s10791-021-09394-4.
[21] S.A. Asgari, M. Enayati, G. Abaei, and M Binesh-Mavesti, “Providing an Improved Webmining Algorithm for Semantic Web,” Soft Comput. J., vol. 5, no. 1, pp. 2-13, 2016 [In Persian].
[22] Z. Farahmandpoor, H. Nikmehr, M. Mansoorizade, and O. Tabibzadeh-Ghamsary, “A Novel Intelligent Persian Authorship System based on Writing Style,” Soft Comput. J., vol. 1, no. 2, pp. 26-35, 2012, dor: 20.1001.1.23223707.1391.1.2.60.9 [In Persian].
[23] E.M.B. Nagoudi, H. Cherroun, and A. Alshehri, “Disguised Plagiarism Detection in Arabic Text documents,” 2nd Int. Conf. Natural Lang. Speech Process., pp. 1-6, 2018, doi: 10.1109/ICNLSP.2018.8374395.
[24] M. Agrawal and D.K. Sharma, “A State of Art on Source Code Plagiarism Detection,” 2nd Int. Conf. Next Gener.  Comput.  Technol., pp.  236-241, 2020, doi: 10.1109/NGCT.2016.7877421.
[25] H. Arabi and M. Akbari, “Improving plagiarism detection in text document using hybrid weighted similarity,” Expert Sys. Appl., vol. 207, 2022, doi: 10.1016/j.eswa.2022.118034.
[26] A. Ali and A.Y. Taqa, “Analytical Study of Traditional and Intelligent Textual Plagiarism Detection Approaches,” J. Educ. Sci., vol. 31, no. 1, pp. 8-25, 2022, doi: 10.33899/edusj.2021.131895.1192.
[27] S. Zangoei, “Examination of Authors' Stylistic Elements of Electronic Messages based on Researched Studies,” Soft Comput. J., vol. 6, no. 2, pp. 60-71, 2017, dor: 20.1001.1.23223707.1396.6.2.5.9.