학술논문

BugHub: A Large Scale Issue Report Dataset
Document Type
Conference
Source
2024 19th European Dependable Computing Conference (EDCC) EDCC Dependable Computing Conference (EDCC), 2024 19th European. :9-16 Apr, 2024
Subject
Computing and Processing
Computer languages
Computer bugs
Europe
Machine learning
Software
datasets
issue reports
machine learning
Language
ISSN
2642-5610
Abstract
Data on issue reports have been extensively used in the literature for diverse applications. For example, in the last few years, a series of Machine Learning (ML) approaches and models have been proposed to automate software defects management processes, e.g. classification, prioritization and triage of bug fixing and implementation requests. Such works depend entirely on issue reports data and show a growing need for high-quality and heterogeneous datasets, which are not readily available in the field. This paper presents a dataset containing over 2.4 million issue reports collected from 93 projects of several natures, hosted by three tracking systems and written in 16 widely used programming languages. To demonstrate the potential of the dataset, three case studies are discussed, where more than 660,000 labelled samples are used to investigate critical aspects related to the automatic classification of issue reports using ML. Results show that our dataset has great potential and meets the quality requirements for studies that rely on issue reports data.