학술논문

MSR4ML: Reconstructing Artifact Traceability in Machine Learning Repositories
Document Type
Conference
Source
2021 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) SANER Software Analysis, Evolution and Reengineering (SANER), 2021 IEEE International Conference on. :536-540 Mar, 2021
Subject
Computing and Processing
Productivity
Machine learning algorithms
Software algorithms
Machine learning
Manuals
Tools
Software
Model Traceability
Machine Learning Operations
Mining Software Repositories
Model Mining
Metadata Extraction
Developer Productivity
Language
Abstract
The increasing popularity of Machine Learning (ML) is generating challenges also for developers. The multitude of programming languages, libraries and available resources allow them to easily build their own models or algorithms. However, ML models are tightly connected to their data implying a different development process from other types of software. Software projects often rely on version control platforms, such as GitHub, but these platforms have not yet been extended to support ML projects. There is poor support for data versioning and no link between ML and software artifacts. Thus, traceability and model evolution can become challenging for developers. While some specific ML platforms exist, they still require considerable manual specification of ML artifacts and links between them. In this work, we propose a framework for automatic identification and traceability of links between data, code and ML model through Mining Software Repositories (MSR) techniques. Our tool combines static code analysis and mining commit data to identify ML, code and data artifacts, reconstruct links between them and retrieve commits that affect each end of the link. The objective is to increase productivity and the developers’ awareness of their project through the recovered traceability.