학술논문

DVC in Open Source ML-development: The Action and the Reaction
Document Type
Conference
Source
2024 IEEE/ACM 3rd International Conference on AI Engineering – Software Engineering for AI (CAIN) CAIN AI Engineering – Software Engineering for AI (CAIN), 2024 IEEE/ACM 3rd International Conference on. :75-80 Apr, 2024
Subject
Computing and Processing
Customer services
Training data
Organizations
Machine learning
Control systems
Data models
Distance measurement
Empirical Software Engineering
Data Version Control
Software Evolution
SE4AI
Language
Abstract
Machine Learning (ML) systems are gaining popularity, reshaping various domains ranging from customer services to software engineering. The effectiveness of ML systems is dependent on the quality of their training data. Therefore, practitioners invest substantial time experimenting with different data, parameters, and models to guarantee the quality of the end system. Prior work highlighted unique challenges of developing ML systems, particularly concerning versioning data and models. Recently, various tools such as DVC and MLFlow have emerged to aid developers in the storage and tracking of data. Despite their growing popularity, very little is known about their usage patterns and impact on open-source software (OSS) systems. To address this gap, we conducted an empirical study on 56 GitHub OSS projects that use DVC to understand the DVC usage pattern and the impact of using DVC on the software development process. We found that Versioning and tracking is the most adopted DVC feature, being utilized by all 56 projects and being the only adopted feature in 85.7% of them. Furthermore, we found that DVC has a significant impact on the software development process indicators such as the number of created pull requests (PRs), and the number of bug-fix commits. For instance, our findings showed that DVC causes a peak in the number of commits and PRs at the moment of the adoption, followed by a long-term decrease. We believe that our findings can assist practitioners in tailoring tools to better meet user requirements and help organizations realize potential outcomes of adopting such tools.CCS CONCEPTS• Software and its engineering → Software configuration management and version control systems.