Rice University researchers have created a strategy called ShareBackup, which enables shared backup switches in data centers to assume network traffic, in the event of a software or hardware switch failure. The changeover happens in less than a second, and hopes to remedy sluggish networks.
“A data network consists of servers and network switches,” says Eugene Ng, professor of computer science and electrical and computer engineering at Rice. “Switches move data packets to where they need to go. But things fail, especially in large-scale data centers with thousands of pieces of hardware.”
The typical protocol of switch failure is to shunt the data flow to a different line. Since the connecting servers of networks have multiple paths, data centers reroute around the problem area.
However, sometimes the alternate route becomes congested and the networks comes to a halt.
“Data centers aren’t the internet; they’re not about people surfing websites,” says Ng. “They’re about supporting data-intensive applications like data mining or machine learning. And a lot of these applications have stringent performance deadlines, so blindly rerouting traffic could be the wrong thing to do in a data center.”
The Rice team decided to place fast software and switches in planned locations, rather than install redundant switches throughout the entire system. This choice allows ShareBackup to handle the extra data load at faster speeds. After factoring in latency from control and hardware systems, ShareBackup records a failure-recovery time of 0.73 milliseconds.
“The reality is the fraction of devices that fail at any given time is very small, and most of these failures can be addressed by things like rebooting the device,” says Ng. “Sometimes the software gets screwed up and a simple power cycle will bring it back. These failures may also not last long.”
“These are the characteristics we’re trying to exploit. Because of that, we can get away with having very few devices back up a large number of devices,” Ng adds.
Ng and the rest of the Rice team believe ShareBackup can save both time and money within data centers, by analyzing and working through problems while maintaining full bandwidth.
The researcher paper, “Masking failures from application performance in data center networks with shareable backup,” will be presented in Budapest, Hungary, at the SIGCOMM ’18 conference.
Filed Under: Infrastructure