If Your Data Center Isn’t Testing Redundancy, It Isn’t Redundant
What’s the best way to ensure business continuity in a data center? Redundancy.
Redundancy, at its core, is simple: It’s a design concept in which, for any component, you build more than is technically necessary so that you have backup in case of a failure. Chris Yetman, our chief operating officer, explains it this way: “You’re building out more capacity than you need so that you can suffer any failure and still continue to operate.”
Redundancy can take the form of n + 1, where n is the number of components necessary to run the system and 1 is the number of redundant components, n +2, 2n, and so on. The intensity of your approach depends both on the importance of the component you’re replicating and on the importance of continuity to your business.
For example, a software development business might opt for lower redundancy levels, because downtime doesn’t have a catastrophic effect on the business. But for a SaaS company, an ecommerce company, or a storage provider, even a few minutes of downtime are a disaster—to their service and to their reputation—so redundancy levels will be higher.
Untested redundancy is not redundant.
Here’s the thing about redundancy: Building it into your design is only half the process. The other half is testing—regular, comprehensive testing—not only of your redundant components but of your entire system, to ensure that everything works together as planned.
Unfortunately, data centers don’t test redundancy as often as they should. “Many times in our industry,” Chris says, “people build beautifully well-engineered redundant systems. But once they’re running, they’re afraid to mess with them, because the downtime associated with making a mistake is huge.”
That fear is understandable—after all, the last thing a data center wants to do is tell its customers that downtime was a result of an internal mistake—but it’s also a problem. “If you don’t test redundancy,” Chris says, “when you need it, it’s not going to be there.” That’s because over time, any electrical or mechanical system inevitably wears down.
Chris points to a common situation. “Say the power goes out and everything works. But actually, what’s keeping everything working is that one of your redundant components has kicked in. If you don’t know that, you just went from redundancy to non-redundancy.” In other words, your center might still be functional, but your redundancy isn’t—and the next time you have an incident, you’re going down.
On the other hand, if you’re actively testing redundancy and you have a failure, you benefit in two ways. First, you’re still running, because you’ve got your n components in place; and second, now you understand which redundant component didn’t work, and you can fix it. That’s why we’re always testing redundancy at Vantage—and we do it aggressively.
“Finding a problem is a good thing,” Chris says. “That means, later on, it’s not going to bite us in the butt.”
The problem with 5 nines.
Some data centers promise 99.999% uptime, also known as “5 nines.” That translates to 5.26 minutes of downtime per year.
“That sounds really cool, right?” Chris says. “But when you’re running a 6-megawatt load with 20,000 servers, or even a 3-megawatt load with 10,000, do you know what happens when you unplug all those servers at once? Your life gets very miserable very quickly.”
That’s because it takes many hours, and many people, to get the servers booted back up and working appropriately. “With a disruption like that, it never comes back cleanly,” Chris says. “So you have dozens of engineers and admins crawling through thousands of servers, verifying whether or not the services are online, figuring out which ones aren’t, trying to understand why, and then correcting all the failures. “
Far better than 5 nines? Never having any downtime at all. And at Vantage, that’s what we offer customers who opt for high levels of redundancy. For example, in a 2n redundant configuration, with an A-side system and a B-side system and power distribution from both sides, there’s a 100% uptime guarantee. “What we’re saying is that you’re not going to go down—ever,” Chris says.
How can we be so confident? Because of all the maintenance and the testing that we do. And if the system were to fail—anything, after all, is possible—the SLA agreement kicks in, and we refund some of your money. “That’s a commitment that shows that we’re willing to accept pain for not having done the job correctly,” Chris says.
Why failures happen—and how to make sure they don’t.
Sometimes our customers call us and tell us that another data center they’re in has had a power blip, or maybe it’s even gone down in a storm. Most of the time, those situations are the result of a failure to test redundancy.
“Maybe they had three or four generators, and they only needed two—but they haven’t tested them in four or five years,” Chris says. “They haven’t done monthly runs, they haven’t checked to see how well they start, and they haven’t transferred load. Maybe all the generators did start, but the transfer switch that moves a load from the street over to the generator failed. In that case, you have beautifully running generators but no way to move the power from the generator to the building—because the ATS, or the automatic transfer switch, failed. And you didn’t know it failed, because you didn’t test it.”
At Vantage, we test whether or not that ATS is working by turning on our generators and moving power from the street to the generators using that ATS switch. If it fails, we still have the UPS systems for our customers, which holds up the load. And now we know something crucial: That automatic transfer switch is broken. “That is excellent. It’s fantastic,” Chris says. “Because now we know.”
Customization down to the rack.
And at Vantage, we want our customers to get the level of redundancy that is appropriate for their needs—so we’ve configured our data centers to allow for a granular level of customization.
One way we look at it is through the lens of concurrent maintainability. That’s the idea that you can maintain a piece of equipment without taking down service. By default, our redundancy is built in to be concurrently maintainable, but you purchase extra equipment to make that work. And that equipment also has to be maintained. And both of those things raise the cost.
If a customer doesn’t need that level of redundancy, we can leave it out of a build. That lowers our costs, and we can in turn offer a lower rate to the customer.
Most of the time, we can reduce redundancy with electrical equipment and not, for example, cooling, because the cooling load has to be maintained at a certain level for the system to function. A more typical scenario is that a customer will opt out of two UPSs or a redundant power feed.
As COO, Chris has over 18 years of operations, engineering and IT experience in the Internet infrastructure industry. Chris is responsible for leading operations, security, network and IT for Vantage. He most recently served as SVP, Process and Technology at Integra. Previously, Chris was VP of AWS Infrastructure Operations at Amazon, where he had worldwide responsibility for operations and network for Amazon’s data centers. Chris also served as SVP of Operations at Level 3 Communications, SVP of Operations at Elevation Data Centers and VP of Operations Architecture at Genuity.
Chris graduated from Northeastern University with a Bachelor of Science in Computer Engineering.
Developers, IT and Facilities – Why It’s Best to Bring Them Together
Data centers used to be the purview of two different departments within an organization–IT and facilities. Traditionally, these two groups operated in silos, isolated from…
How can data center providers fill the talent shortage? Hire veterans.
Recently Data Center Knowledge featured an article talking about the recent trend of data center providers and technology companies finding staff among the pool of…
To Build, Buy or Lease – That is the (Data Center) Question
In my last post for Data Centers Today, we took a look at the myriad of options that companies have today for hosting and storing…