학술논문

SimilaCode: Programming Source Code Similarity Detection System Based on NLP
Document Type
Conference
Source
2023 15th International Congress on Advanced Applied Informatics Winter (IIAI-AAI-Winter) IIAI-AAI-WINTER Advanced Applied Informatics Winter (IIAI-AAI-Winter), 2023 15th International Congress on. :171-178 Dec, 2023
Subject
Computing and Processing
Source coding
Plagiarism
Prototypes
Programming
Linguistics
Licenses
Natural language processing
Code Plagiarism
Programming Languages
Python
Code Clone
Vector Cosine Model
Language
Abstract
Some tools have been developed in the scientific field to detect similarities in texts; however, some software is not very efficient in detecting plagiarism in programming source codes. In computing, it is expected to find cases of plagiarism in the source code, and there are currently tools that measure the degree of similarity, but they require paid licenses. This scientific article proposes constructing a system that uses Natural Language Processing (NLP), vector space models, and similarity metrics to identify the degree of divergence between pairs of source codes in the Python programming language, with the possibility of extrapolating its applicability to other programming languages. The proposed system is structured in several modules, each with a specific function for both the back-end and front-end of the prototype deployed on the web. The experimentation was carried out using pairs of source codes subjected to modifications at a linguistic and structural level. The results show that our system, Similacode, can detect 100% similarities between source code pairs that have changed their comments. It was observed that the system could identify similarities, even when modifications have been made to the names of variables and functions, reaching levels of similarity higher than 88%. In addition, comparisons were made with two other plagiarism detection tools to assess the degree of similarity, obtaining results with less than 1% differences between the different software. The experiments in Similacode have yielded satisfactory results, demonstrating the system's efficiency in detecting similarities in the analyzed source codes.