학술논문

SQR: In-network Packet Loss Recovery from Link Failures for Highly Reliable Datacenter Networks
Document Type
Conference
Source
2019 IEEE 27th International Conference on Network Protocols (ICNP) Network Protocols (ICNP), 2019 IEEE 27th International Conference on. :1-12 Oct, 2019
Subject
Communication, Networking and Broadcast Technologies
Computing and Processing
Packet loss
Delays
Optical switches
Hardware
Web search
Language
ISSN
2643-3303
Abstract
In datacenter networks, flows need to complete as quickly as possible because the flow completion time (FCT) directly impacts user experience, and thus revenue. Link failures can have a significant impact on short latency-sensitive flows because they increase their FCTs by several fold. Existing link failure management techniques cannot keep the FCTs low under link failures because they cannot completely eliminate packet loss during such failures. We observe that to completely mask the effect of packet loss and the resulting long recovery latency, the network has to be responsible for packet loss recovery instead of relying on end-to-end recovery. To this end, we propose Shared Queue Ring (SQR), an on-switch mechanism that completely eliminates packet loss during link failures by diverting the affected flows seamlessly to alternative paths. We implemented SQR on a Barefoot Tofino switch using the P4 programming language. Our evaluation on a hardware testbed shows that SQR can completely mask link failures and reduce tail FCT by up to 4 orders of magnitude for latency-sensitive workloads.