학술논문

Gem5-MARVEL: Microarchitecture-Level Resilience Analysis of Heterogeneous SoC Architectures
Document Type
Conference
Source
2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA) HPCA High-Performance Computer Architecture (HPCA), 2024 IEEE International Symposium on. :543-559 Mar, 2024
Subject
Computing and Processing
Microarchitecture
Systems architecture
Computer architecture
Heterogeneous networks
Hardware
Libraries
System-on-chip
Reliability
CPUs
accelerators
heterogeneous architectures
transient faults
permanent faults
silent data corruptions
microarchitecture-level fault injection
Language
ISSN
2378-203X
Abstract
In this paper, we present gem5-MARVEL, the first consolidated microarchitecture-level fault injection infrastructure for heterogeneous System-on-Chip architectures comprising CPUs of all major Instruction Set Architectures (ISAs) and different types of domain-specific accelerators. The proposed framework is based on a modular design that facilitates flexible fault injection scenarios that correspond to different fault models and system configurations. gem5-MARVEL includes a set of libraries for the automation of fault injection and the analysis of the effects of hardware faults at full system execution. We evaluate the proposed framework on several 64-bit CPU ISAs: x86, Arm, and RISC-V, as well as on different designs of domain-specific accelerators. The case studies we present unveil important insights and demonstrate the effectiveness of the proposed infrastructure in the analysis of the impact of faults on different types of heterogeneous computing systems. gem5-MARVEL facilitates broad design space exploration for entire heterogeneous computing systems at the microarchitecture level, where resilience under realistic fault scenarios can be simultaneously analyzed with performance (the typical use of microarchitectural simulators).