Earlier this month, a wireless outage swept across the eastern United States, equally impacting customers from all four major wireless carriers. At the heart of the outage was a technical issue at Internet backbone company Level 3 Communications.
While the blackout was resolved fairly quickly, outages like the most recent one also carry hefty financial and brand reputation costs for the carriers.
But according to Continuity Software CEO Gil Hecht, incidents like these can be avoided. The key, Hecht said, is redundancy.
In order to have a fail-proof system, Hecht said it is critical to have redundancy in place across every layer of the infrastructure. For example, he said, if a company has fiber optic cable, there needs to be a second backup cable that runs along a different route than the original. Additionally, Hecht said all network systems must also be redundant. If redundancy is in place in both of those areas, then companies should be protected from any kind of disruption.
According to Hecht, though, the reason disruptions keep occurring is not that companies aren’t aware of the need for redundancy, but rather because some part of the network chain is not aware that the redundancy exists.
“It can either happen because there is a router or a switch somewhere that doesn’t know how to direct traffic through the alternate route, or it happens because you have two identical systems that have identical configurations, but in one the configuration drifts,” Hecht said. Drift can occur when a technician makes changes to one system but fails – for whatever reason – to mirror the changes in the other, he said.
Two additional factors are also at play in failures like the one at Level 3, Hecht said. First, redundancy is – by nature – something that is “very difficult” to test.
“To test redundancy you need to be willing to suffer down time,” Hecht said. “To test the alternate, you need to have primary path fail.”
Hecht noted there are alternative means to test by introducing automation, but said that’s where the second aggravating factor of cost comes in. Unlike fail testing, automation checks for redundancy by verifying elements of the system configuration. Most companies, though, tend to invest in the problem areas that have most recently touched their memory – which means there will inevitably be gaps in coverage elsewhere, Hecht said.
“Down time in data loss incidents will continue to happen for the foreseeable future because when humans are involved in building something they will make mistakes,” Hecht said. “The only way to avoid that is to introduce automation into testing, which you will only do where it makes financial sense. So by definition there will be holes.”
All that said, Hecht noted the country’s most critical infrastructure – especially in the financial, government and communications sectors – is fairly well protected.
“This failure was a relative success – very unpleasant, huge damage but ultimately it all came back online very quickly,” Hecht said. “I think this specific type of infrastructure failure is most likely not going to happen again in the short term because companies will develop systems to prevent it.”
When reached for comment, Level 3 Communications declined to provide details around its continuity plans or reporting relationship with customers. The company did, however, say the October 4 outage was “unfortunate” and it has “taken steps and put processes in place to keep these types of events from happening in the future.”
AT&T, Verizon, Sprint and T-Mobile declined to comment for this story.
Filed Under: Infrastructure