학술논문

DP-CCL: A Supervised Contrastive Learning Approach Using CodeBERT Model in Software Defect Prediction
Document Type
Periodical
Source
IEEE Access Access, IEEE. 12:22582-22594 2024
Subject
Aerospace
Bioengineering
Communication, Networking and Broadcast Technologies
Components, Circuits, Devices and Systems
Computing and Processing
Engineered Materials, Dielectrics and Plasmas
Engineering Profession
Fields, Waves and Electromagnetics
General Topics for Engineers
Geoscience
Nuclear Engineering
Photonics and Electrooptics
Power, Energy and Industry Applications
Robotics and Control Systems
Signal Processing and Analysis
Transportation
Codes
Semantics
Self-supervised learning
Predictive models
Object oriented modeling
Source coding
Computer bugs
Fault detection
Deep learning
Learning systems
Software bug prediction
software fault prediction
software defect prediction
BERT
CodeBERT
language model
pre-trained model
deep learning
contrastive learning
contrastive loss
Language
ISSN
2169-3536
Abstract
Software Defect Prediction (SDP) reduces the overall cost of software development by identifying the code at a higher risk of defects at the initial phase of software development. SDP helps the test engineers to optimize the allocation of testing resources more effectively. Traditional SDP models are built using handcrafted software metrics that ignore the structural, semantic, and contextual information of the code. Consequently, many researchers have employed deep learning models to capture contextual, semantic, and structural information from the code. In this article, we propose the DP-CCL (Defect Prediction using CodeBERT with Contrastive Learning) model to predict the defective code. The proposed model employs supervised contrastive learning using this CodeBERT Language model to capture semantic features from the source code. Contrastive learning extracts valuable information from the data by maximizing the similarity between similar data pairs (positive pair) and meanwhile minimizing the similarity between dissimilar data pairs (negative pair). Moreover, The model combines the semantic features with software metrics to obtain the benefits of both semantic and handcrafted features. The combined features are input to the logistic regression model for code classification as either buggy or clean. In this study, ten PROMISE projects were utilized to conduct the experiments. Results show that the DP-CCL model achieved significant improvement i.e., 4.9% to 14.9% increase in F-Score as compared to existing approaches.