Everything you need to know about the NVIDIA L40S GPU
Learn more about the NVIDIA L40S, a versatile GPU that is designed to power a wide variety of applications, and check out NVIDIA L40S vs NVIDIA H100...
A global GPU shortage and rogue compute costs can threaten to sink even the best AI project’s go-to-market plans. How can AI teams navigate around the GPU demand storm, and deploy models for real-time inference at scale?
Specific GPU demand has sped up drastically due to the AI boom. Despite the frenzied progress of AI capabilities, model training and optimisation, there simply isn’t enough infrastructure to go around. In 2023, AI product deployment globally has slowed down to a dead calm because of how expensive and inaccessible GPUs are at the moment, which are required to run the most capable AI products.
The global chip shortage has even stalled leader OpenAI recently, as they were forced to pause new users from signing up to the paid ChatGPT Plus services. From November 15, 2023 to December 13, paying users were unable to join, as OAI exceeded their GPU infrastructure capacity.
Meanwhile, the coveted Nvidia GPU H100s—today’s fastest, most powerful commercial GPUs for AI workloads—are sold out, with pre-orders not shipping until Q1 2024.
So, just how are AI teams supposed to deploy multimodal models and inference at scale, if companies like Google, OpenAI and Microsoft are facing the same bottlenecks? In this post, we’ll talk about the strategies we’ve learned with customers to scale AI workloads atop Kubernetes, go multicloud with GPU resources, and alleviate some pain points when AI teams feel the GPU infrastructure squeeze.
There’s a few key techniques we’ve discovered alongside our customers as we’ve built the Ori GPU Cloud offering. We’ll share what we’ve learned—as long as you don’t mind crossing a few sailing analogies.
Working with our AI customers who build and deploy models at scale feels like navigating a storm at sea. Seasoned sailors never want to challenge one head on, but it’s important to know how to navigate extreme weather regardless. Kubernetes and Helm charts become our navigation tools, shipping AI apps to managed Kubernetes environments and solving the complexity of shifting clouds. Meanwhile, providing a GPU cloud infrastructure ensures our clients don’t wait for hours for their engines to spin up.
Much of today’s perceived GPU shortage is largely a result of improper utilisation and overspending on GPUs that fail to meet the specific requirements of large-scale AI applications. While low-latency workloads certainly want an H100 to run, chip manufacturers and the media have created the hype that AI developers must possess only the most powerful GPUs. That simply isn't the case. Even A100s are overboard for most uses. AI teams can easily rely on older V100s for instance, to speed up pipelines of model inference while drastically reducing cost and ensuring high availability. It all depends on what infrastructure you need, tailored to the job.
While single GPU cloud providers claim to host a wide selection of GPUs, some rentals are incredibly expensive and don’t offer much flexibility. A truly tailored GPU infrastructure starts to take shape when companies can lash together their compute resources.
In some cases, startups with cloud credits from the major public clouds can't use them access to more powerful GPUs!
This is where multicloud capabilities are so invaluable, and the one of the reasons why Ori has been chosen by customers over others—we’ve been doing multicloud longer than most of the specialised GPU clouds.
While multicloud strategies offer flexibility and redundancy, they introduce additional layers of complexity. Coordinating AI workloads seamlessly across different cloud providers demands careful consideration of compatibility, data transfer, and service interoperability.
Enterprises often find themselves traversing multicloud environments. While these strategies offer flexibility and redundancy, they can introduce additional layers of complexity that stall out AI teams who don’t want to spend their time focused on infrastructure. Coordinating AI workloads seamlessly across different cloud providers demands careful consideration of compatibility, data transfer, and service interoperability.
Consider a scenario where an organisation utilises generative AI models for creative design processes. The flexibility of multicloud deployment ensures that these AI workloads can seamlessly leverage resources from different cloud providers, optimising costs and ensuring consistent performance.
Here, Kubernetes shines as a unifying force. Its agnostic nature allows for consistent workload management, irrespective of the underlying infrastructure. By leveraging Kubernetes, enterprises can deploy AI models across diverse cloud environments without compromising operational efficiency (more on that in the next section).
Modern multicloud systems typically involve Kubernetes—the robust, open-source container orchestration platform. Kubernetes provides a standardised way to deploy, manage, and scale containerized applications, acting like a maritime logistics platform for moving AI workloads efficiently, and without incident.
Its declarative configuration and automated scaling capabilities streamline the deployment of AI models. For instance, consider a case where a generative AI model is deployed for content creation. Kubernetes ensures that as the demand for content generation increases, the model scales horizontally to meet the rising workload.
However, like any ship navigating stormy waters, Kubernetes isn't without its challenges. Orchestrating AI workloads efficiently requires a deep understanding of resource utilisation, networking, and storage configurations. Ensuring optimal performance and resource allocation becomes a puzzle that demands continuous refinement.
There is a lot of great content out there about opting for Kubernetes orchestrators to handle inference at scale. Essentially, real-time inference is where AI companies can make the biggest bang and disrupt their industry. Think of real-time fraud detection, real-time video creation from text input, or customer service bots that need to reply in seconds before losing engagement.
Yaron Sternbach has a great writeup on developing model inference at scale, and how his team developed a high-level architecture of serving models across Kubernetes clusters. Here are some of our learnings around designing architecture for inference use cases:
Crewing an AI team is an exercise in finding specialists who can help navigate changing weather on a day-to-day basis. Relying solely on a GPU cloud provider that serves as a mere GPU shelf can hinder progress and limit the potential of AI initiatives. Instead, AI teams should engage with professional services offered by their GPU cloud providers and Kubernetes experts to gain specialised guidance. Your crew manifest should include an infrastructure provider who knows how to work with GPU-accelerated workloads at enterprise scale. We’ve picked up a few industry metrics along the way, around specialised GPU cloud provisions:
As we navigate the stormy seas of scaling AI workloads with our customers, Kubernetes, multicloud strategies, and GPU availability have emerged as crucial points of sail.
In our experience, the best setup for scaling AI workloads on Ori looks like this in a nutshell:
Learn more about the NVIDIA L40S, a versatile GPU that is designed to power a wide variety of applications, and check out NVIDIA L40S vs NVIDIA H100...
Inside the NVIDIA H200: Specifications, use cases, performance benchmarks, and a comparison of H200 vs H100 GPUs.
Explore the NVIDIA Blackwell GPU platform, featuring powerful superchips like B100, B200, and GB200. Discover how these GPUs are about to unleash a...