학술논문

SciNet: Codesign of Resource Management in Cloud Computing Environments
Document Type
Periodical
Source
IEEE Transactions on Computers IEEE Trans. Comput. Computers, IEEE Transactions on. 72(12):3590-3602 Dec, 2023
Subject
Computing and Processing
Cloud computing
Optimization
Quality of service
Computational modeling
Artificial intelligence
Resource management
Costs
Concurrent design
cloud computing
resource management
provisioning
deployment
scheduling
imitation learning
deep learning
Language
ISSN
0018-9340
1557-9956
2326-3814
Abstract
The rise of distributed cloud computing technologies has been pivotal for the large-scale adoption of Artificial Intelligence (AI) based applications for high fidelity and scalable service delivery. Systematic resource management is central in maintaining optimal Quality of Service (QoS) in cloud platforms and is divided into three fundamental types: resource provisioning, AI model deployment and workload placement. To exploit the synergy among these decision types, it becomes imperative to concurrently design (co-design) the provisioning, deployment and placement decisions for optimal QoS. As users and cloud service providers shift to non-stationary AI-based workloads, frequent decision making imposes severe time constraints on the resource management models. Existing AI-based solutions often optimize decision types independently and tend to ignore the dependencies across various system performance aspects such as energy consumption and CPU utilization, making them perform poorly in large-scale cloud systems. To address this, we propose a novel method, called SciNet, that leverages a co-simulated digital-twin of the infrastructure to capture inter-metric dependencies and accurately estimate QoS scores. To avoid expensive simulation overheads at test time, SciNet trains a neural network based imitation learner that aims to mimic an oracle, which takes optimal decisions based on co-simulated QoS estimates. Offline model training and online decision making based on the imitation learner, enables SciNet to take optimal decisions while being time-efficient. Experiments with real-life AI-based benchmark applications on a public cloud testbed show that SciNet gives up to 48% lower execution cost, 79% higher inference accuracy, 71% lower energy consumption and 56% lower response times compared to the current state-of-the-art methods.