학술논문

Single Event Effects Assessment of UltraScale+ MPSoC Systems Under Atmospheric Radiation
Document Type
Periodical
Source
IEEE Transactions on Reliability IEEE Trans. Rel. Reliability, IEEE Transactions on. 73(1):771-783 Mar, 2024
Subject
Computing and Processing
General Topics for Engineers
Neutrons
Life estimation
Computer crashes
Benchmark testing
Radiation effects
Table lookup
Program processors
Multiprocessor system-on-chip (MPSoC) terrestrial applications
neutron radiation testing
single event effects (SEEs)
Language
ISSN
0018-9529
1558-1721
Abstract
The AMD UltraScale+ XCZU9EG, a multiprocessor system-on-chip (MPSoC) with integrated programmable logic (PL), is vulnerable to the effects of atmospheric radiation due to its large SRAM count. This article explores the effectiveness of the MPSoC's embedded soft-error mitigation mechanisms through accelerated atmospheric-like neutron radiation testing and dependability analysis. We test the device on a broad range of workloads, such as multithreaded software for pose estimation and weather prediction and a software/hardware codesign image classification application running on the AMD deep-learning processing unit (DPU). We found that for a one-node MPSoC system in New York City at 40 k feet (e.g., avionics), software applications demonstrate a mean time to failure (MTTF) of over 121 months, evidencing effective upset recovery. However, specific workloads, such as the DPU, displayed an MTTF of 4 months, which is attributed to the high failure rate of its PL accelerator. Yet, we show the DPU's MTTF can be extended to 87 months with no extra overhead by ignoring the failure rate of tolerable errors since these do not affect the DPU results.