학술논문

Design and Implementation of a Scalable HPC Monitoring System
Document Type
Conference
Source
2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Parallel and Distributed Processing Symposium Workshops, 2016 IEEE International. :1721-1725 May, 2016
Subject
Bioengineering
Communication, Networking and Broadcast Technologies
Components, Circuits, Devices and Systems
Computing and Processing
Monitoring
Measurement
Laboratories
Hardware
Operating systems
Servers
Language
Abstract
Over the past decade, platforms at Los AlamosNational Laboratory (LANL) have experienced large increases in complexity and scale to reach computational targets. The changes to the compute platforms have presented new challenges to the production monitoring systems in which they must not only cope with larger volumes of monitoring data, but also must provide new capabilities for the management, distribution, and analysis of this data. This schema must support both real-time analysis for alerting on urgent issues, as well as analysis of historical data for understanding performance issues and trends in systembehavior. This paper presents the design of our proposed next-generation monitoring system, as well as implementation details for an initial deployment. This design takes the form of a multi-stage data processing pipeline, including a scalable cluster for data aggregation and early analysis, a message broker for distribution of this data to varied consumers, and an initial selection of consumer services for alerting and analysis. We will also present estimates of the capabilities and scale required to monitor two upcoming compute platforms at LANL.