Detection of plagiarism in scientific texts based on text blocking and cosine similarity criteria

Document Type : Original Article

Authors

Department of Computer Engineering, Naqsh Jahan Institute of Higher Education, Isfahan, Iran

Abstract

In the last decade, with the expansion of the World Wide Web, the speed and ease of access to ideas, documents, articles, manuscripts, and data collected by others has increased. This has made the exchange of information and ideas between researchers and producers of science easier, but on the other hand, it has made it easier to apply unauthorized copies, write summaries without mentioning the source, and steal literary texts in general. Since universities and educational centers make scientific and research resources easily available to most users, recognizing the authenticity of scientific texts in these centers is more important and, of course, more sensitive. In this research, a method is presented to compare the related parts using the blocking of document parts. In the proposed method, after classifying the documents into two categories of main documents and suspicious documents, preprocessing has been done with the aim of eliminating word stops and new wording. Then the documents are segmented and using cosine similarity, the degree of similarity of the texts with each other is determined. The proposed method in the test of 50 documents in the data set has an accuracy of 94%, which is an improvement of 2% compared to one of the similar methods.

Keywords


[1] G. Sarwar, C. O’Riordan, and J. Newell, “Passage Level Evidence for Effective Document Level Retrieval,” Proc. 9th Int. Joint Conf.Knowl. Discov. Knowl. Eng. Knowl. Manag. (KDIR), pp. 83-90, 2017, doi: 10.5220/0006502800830090.
[2] I.     Jaric, “High Time for a common Plagiarism Detection System,” Scientometrics, vol. 106, no. 1, pp. 457-459, 2016, doi: 10.1007/s11192-015-1756-6.
[3] I. Masic, E. Begic, and A. Dobraca, “Plagiarism Detection by Online Solutions,” Informatics Empowers Healthcare Transformation, ICIMTH 2017, Athens, Greece, pp. 227-230, 2017, doi: 10.3233/978-1-61499-781-8-227.
[4] V. Kanjirangat and D. Gupta, “Study on Extrinsic Text Plagiarism Detection Techniques and Tools,” J. Eng. Sci. Technol. Rev., vol. 9, no. 5, pp. 8-22, 2016, doi: 10.25103/jestr.095.02.
[5] S.M. Alzahrani, N. Salim, and A. Abraham, “Understanding Plagiarism Linguistic Patterns, Textual Features, and Detection Methods,” IEEE Trans. Sys. Man Cybern., Part C, vol. 42, no. 2, pp. 133-14, 2012, doi: 10.1109/TSMCC.2011.2134847.
[6] M. Potthast, A. Barron-Cedeno, B. Stein, and P. Rosso, “Cross-language Plagiarism Detection,” Lang. Resour. Evaluation, vol. 45, no. 1, pp. 45-62, 2011, doi: 10.1007/s10579-009-9114-z.
[7] C. Grozea, C. Gehl, and M. Popescu, “ENCOPLOT: Pairwise Sequence Matching in Linear Time Applied to Plagiarism Detection,” 3rd PAN Workshop Uncovering Plagiarism, Authorship and Social Software Misuse, WUPAS, pp. 10-18, 2009.
[8] C. Basile, D. Benedetto, E. Caglioti, G. Cristadoro, and M. Esposti, “A Plagiarism Detection Procedure in Three Steps: Selection, Matches and Squares,” Proc. SEPLN, pp. 19-23, 2009.
[9] M. Elhadi and A. Al-Tobi, “Duplicate Detection in Documents and Webpages using Improved Longest Common Subsequence and Documents Syntactical Structures,” 4th Int. Conf. Comput. Sci. Convergence Inf. Technol., pp. 679-684, 2009, doi: 10.1109/ICCIT.2009.235.
[10] K. Koroutchev and M. Cebrian, “Detecting Translations of the Same Text and Data with Common Source,” J. Stat. Mech. Theory Exp., vol. 10, 2006, doi: 10.1088/1742-5468/2006/10/P10009.
[11] Y. Li, D. McLean, Z.A. Bandar, J.D. O’shea, and K. Crockett, “Sentence Similarity Based on Semantic Nets and Corpus Statistics,” IEEE trans. Knowl. Data Eng., vol. 18, no.8, pp.1138-1150, 2006, doi: 10.1109/TKDE.2006.130.
[12] S. Gruner and S. Naven, “Tool Support for Plagiarism Detection in Text Documents,” Proc. ACM SAC, Santa Fe, New Mexico, USA, pp. 776-781, 2005, doi: 10.1145/1066677.1066854.
[13] S. Alzahrani and N. Salim, “Statement-based Fuzzy-set IR Versus Fingerprints Matching for Plagiarism Detection in Arabic Documents,” Proceedings 5th Postgraduate Annual Research Seminar (RARS 09), pp. 267-268, 2009.
[14] A.H. Osman, N. Salim, and A. Abuobieda, “Survey of Text Plagiarism Detection,” Comput. Eng. Appl. J., vol. 1, no. 1, pp. 37-45, 2012, doi: 10.18495/comengapp.v1i1.5.
[15] T.W. Chow and M. Rahman, “Multilayer SOM with Tree-Structured Data for Efficient Document Retrieval and Plagiarism Detection,” IEEE Trans. Neural Networks, vol. 20, no. 9, pp. 1385-1402, 2009, doi: 10.1109/TNN.2009.2023394.
[16] S. Rakian, E.F. Safi, and H. Rastegari, “A Persian Fuzzy Plagiarism detection Approach,” J. Inf. Sys. Telecommun., vol. 3, no. 11 pp. 182-190, 2015, doi: 10.7508/jist.2015.03.007.
[17] D. Gupta, K. Vani, and L. Leema, “Plagiarism Detection in Text Documents using Sentence Bounded Stop word N-Grams,” J. Eng. Sci. Technol., vol. 11, no. 10, pp. 1403-1420, 2016.
[18] M.A. Sanchez-Perez, A. Gelbukh, G. Sidorov, and H. Gomez-Adorno, “Plagiarism Detection with Genetic-based Parameter Tuning,” Int. J. Pattern Recognit. Artif. Intell., vol. 32, no, 1, 2018, doi: 10.1142/S0218001418600066.
[19] M. Schubotz, O. Teschke, V. Stange, N. Meuschke, and B. Gipp, “Forms of Plagiarism in Digital Mathematical Libraries,” Int. Conf. Intell. Comput. Math., pp. 258-274, 2019, doi: 10.1007/978-3-030-23250-4\18.
[20] C.-Y. Chang, S.-J. Lee, C.-H. Wu, C.-F. Liu, and C.-K. Liu, “Using Word Semantic Concepts for Plagiarism Detection in Text Documents,” Inf. Retr. J., vol. 24, no. 4, pp. 298-321, 2021, doi: 10.1007/s10791-021-09394-4.
[21] S.A. Asgari, M. Enayati, G. Abaei, and M Binesh-Mavesti, “Providing an Improved Webmining Algorithm for Semantic Web,” Soft Comput. J., vol. 5, no. 1, pp. 2-13, 2016 [In Persian].
[22] Z. Farahmandpoor, H. Nikmehr, M. Mansoorizade, and O. Tabibzadeh-Ghamsary, “A Novel Intelligent Persian Authorship System based on Writing Style,” Soft Comput. J., vol. 1, no. 2, pp. 26-35, 2012, dor: 20.1001.1.23223707.1391.1.2.60.9 [In Persian].
[23] E.M.B. Nagoudi, H. Cherroun, and A. Alshehri, “Disguised Plagiarism Detection in Arabic Text documents,” 2nd Int. Conf. Natural Lang. Speech Process., pp. 1-6, 2018, doi: 10.1109/ICNLSP.2018.8374395.
[24] M. Agrawal and D.K. Sharma, “A State of Art on Source Code Plagiarism Detection,” 2nd Int. Conf. Next Gener.  Comput.  Technol., pp.  236-241, 2020, doi: 10.1109/NGCT.2016.7877421.
[25] H. Arabi and M. Akbari, “Improving plagiarism detection in text document using hybrid weighted similarity,” Expert Sys. Appl., vol. 207, 2022, doi: 10.1016/j.eswa.2022.118034.
[26] A. Ali and A.Y. Taqa, “Analytical Study of Traditional and Intelligent Textual Plagiarism Detection Approaches,” J. Educ. Sci., vol. 31, no. 1, pp. 8-25, 2022, doi: 10.33899/edusj.2021.131895.1192.
[27] S. Zangoei, “Examination of Authors' Stylistic Elements of Electronic Messages based on Researched Studies,” Soft Comput. J., vol. 6, no. 2, pp. 60-71, 2017, dor: 20.1001.1.23223707.1396.6.2.5.9.