Few new technologies throughout history have captured the hearts, minds and imaginations of technology companies, pundits and investors as much as artificial intelligence, or AI, has this year. In fact, it’s become a recurring joke or meme in investing and Wall Street circles that any company with an AI solution is a candidate to be the next unicorn.
But what’s not funny is the sheer number of AI solutions and applications springing up across the internet. While ChatGPT might have been the first major disruptor, new generative AI solutions are seemingly introduced every day. What was a limited selection of online AI services has exploded with the introduction of numerous generative AI solutions with hip, clever names, including DALL-E, Bard, Claude and many more.
The AI explosion isn’t just limited to generative AI. Nearly every organization of a certain size is undoubtedly assessing, developing, training or already using an AI solution or two to improve their operations.
This increase in AI adoption has many considering it the next big disruptive technology capable of changing how we live and work. However, the impacts of AI will be felt outside our daily lives and workplace tasks. AI also has the potential to fundamentally change the traditional data center in ways many might not recognize or appreciate.
To understand the impact that AI will have on the data center, we first must understand that there are two kinds of AI workloads or applications: training and inference.
Training vs. inference
While it’s easy to lump all aspects and workloads of artificial intelligence applications into a single bucket called “AI” and think it’s all the same, that’s simply not true. The AI applications we see today are really the combination of two disparate AI workloads: training and inference.
When a new AI solution or algorithm is developed, it needs to be fed data and learn so it can effectively make inferences based on that data. This process of training an AI solution requires a significant amount of data to be ingested by the AI engine. And the more data it can process, the more complete the AI engine will function for its intended use.
Once the AI solution has been trained and tested, it can be deployed. In this part of the AI solution’s life cycle, it’s used for inference — to analyze the data it has been trained on and generate information, predictions and responses as required.
It’s important to separate these two processes or aspects of AI and look at them independently because they both have very different requirements.
The learning process required to enable or create a usable AI solution is incredibly data rich. To make the process go as quickly as possible, technologists work to utilize as many GPUs as possible and have them connected as closely to the source of the data as possible to ensure quick ingestion. This ramps up the rack density, increases the amount of power necessary and drives up the amount of heat generated within the rack. All of which are considerations we’ll explore more closely later in this article.
On the other side, the actual use of the deployed inference engine is much less demanding. In fact, inference AI is something our customers have been using for a long time, and while the rack densities are higher, they’re not unreasonable. While the training and learning aspect of the AI engine might require rack densities of 80kW to 120kW, these inference racks may only require 10kW but could grow to 20kW or even slightly higher. However, this is significantly lower than the requirements for training and can be accommodated in most data centers that are current generation.
Now that we’ve explained the difference between these two aspects of AI, we can explore how each impacts the data center differently — and what changes might be necessary within the data center to support them.
More power, less redundancy?
When you think about typical data centers, you probably think about buildings designed to be durable with redundant power, cooling and connectivity to ensure the equipment inside never goes down. But AI training is a bit different.
The things truly necessary for training an AI inference engine are power, cooling and connectivity. It’s critical that the power be readily available in large quantities, and the connectivity between racks be as rapid as possible. Redundancy isn’t really a consideration or requirement — and for good reason.
This increase in AI adoption has many considering it the next big disruptive technology capable of changing how we live and work. However, the impacts of AI will be felt outside our daily lives and workplace tasks. AI also has the potential to fundamentally change the traditional data center in ways many might not recognize or appreciate.
The process of training the AI — especially large AI models — is time-intensive. According to news reports, ChatGPT-3 was trained by 1,024 GPUs in 34 days.
To speed this process up, AI training leverages as many GPUs in as tight a space as possible. But this remains a process that runs for a long time. As a result, there are checkpoints added to the process by those managing it — the work is saved every “X” period of time.
Also, these incredibly dense racks aren’t delivering services or applications to anyone outside the data center. They’re close together and connected with the highest-speed connectivity available to enable rapid training.
Since there are frequent checkpoints where the progress and work are saved, and since the racks aren’t powering applications that users are relying on, it’s not the end of the world if the data center loses power or otherwise goes down temporarily. It’s also not the end of the world if the data connections to the outside world are temporarily lost or severed.
Redundancy in the data centers where AI models are being trained simply is not a priority. What is a priority is the availability of immense power, in addition to the structural and cooling systems required of these incredibly dense racks.
As these racks get denser — have more GPUs stuffed into them and more cabling run to them — they become incredibly heavy. As the amount of power they require increases, more energy needs to be brought into the data center. And as they generate incredible amounts of heat and are grouped closer together, liquid cooling alternatives to traditional forced-air cooling need to be provided.
So, how do these requirements impact what the data center of the future will look like?
Two in one
As we’ve discussed, the data centers responsible for AI training today are sparsely populated with a few incredibly dense racks that utilize a significant amount of energy and expel an immense amount of heat.
To ensure there is no wasted space, these data centers may be smaller in footprint. They should come prepared and ready to offer a liquid cooling solution that will meet the intense cooling requirements of 80kW to 120kW racks. Finally, they should be built with less redundancy for their power and connectivity infrastructure — it’s simply not needed for the task being done.
On the opposite side of the AI equation are the inference workloads. These need to be accessible to users outside the data center and are capable of being powered by less dense racks in the 10kW to 20kW range. These data centers still require redundancy in their power, cooling and connectivity yet simply do not need the increased power and liquid cooling that AI training data centers require.
This means we can effectively build two different kinds of data centers — smaller data centers designed to meet the unique power and cooling requirements of AI training, and larger, more traditional data centers designed to run the AI model and give users access to it.
But how far away should these two data centers be from each other?
AI models aren’t static. They’re constantly being updated and enhanced. Training is done frequently to ensure the AI solution being offered is as up-to-date as possible. So newly trained models frequently need to be moved to the data centers where they’ll be deployed for use. In this environment, it makes sense to keep the two disparate kinds of data centers close, possibly on the same campus or, preferably, in the same building.
This is why I anticipate that the data centers of the future might actually be a single building with two distinct types of data center environments — a smaller, denser data center with less redundancy for AI training, and a larger, more traditional data center for AI inference.
Regardless of whether this vision of a hybrid AI data center comes to pass, there is one certainty — the explosion of AI and its adoption will have a profound impact on our world and society. It will also have an immense impact on the design, construction and operations of the modern data center.
To learn more about Vantage’s hyperscale data center campuses around the global, visit www.vantage-dc.com.
Chris Yetman
As chief operating officer at Vantage Data Centers, Chris Yetman leads operations, security, construction, design, engineering and IT.
With more than 30 years of experience, Yetman approaches strategy and decision-making from the customers’ viewpoint. In partnership with Vantage’s leadership team, he develops the company’s strategy and evangelizes it throughout the organization. Yetman has been instrumental in accelerating Vantage’s growth from a regional data center provider in the United States to a fast-growing global operator.
Yetman received his Bachelor of Science in computer engineering from Northeastern University.