Blog Post

How to Navigate a Global GPU Shortage & Scale AI Workloads

A global GPU shortage and rogue compute costs can threaten to sink even the best AI project’s go-to-market plans. How can AI teams navigate around the GPU demand storm, and deploy models for real-time inference at scale?

Specific GPU demand has sped up drastically due to the AI boom. Despite the frenzied progress of AI capabilities, model training and optimisation, there simply isn’t enough infrastructure to go around. In 2023, AI product deployment globally has slowed down to a dead calm because of how expensive and inaccessible GPUs are at the moment, which are required to run the most capable AI products.

“According to Gartner, a staggering 85% of AI initiatives fail to reach their full potential. Only 53% of AI projects ever make it from the pilot phase to full-scale deployment”


The global chip shortage has even stalled leader OpenAI recently, as they were forced to pause new users from signing up to the paid ChatGPT Plus services. From November 15, 2023 to December 13, paying users were unable to join, as OAI exceeded their GPU infrastructure capacity.

tweet_1734984269586457078_20231214_145057_via_10015_io


Meanwhile, the coveted Nvidia GPU H100s—today’s fastest, most powerful commercial GPUs for AI workloads—are sold out, with pre-orders not shipping until Q1 2024.

So, just how are AI teams supposed to deploy multimodal models and inference at scale, if companies like Google, OpenAI and Microsoft are facing the same bottlenecks? In this post, we’ll talk about the strategies we’ve learned with customers to scale AI workloads atop Kubernetes, go multicloud with GPU resources, and alleviate some pain points when AI teams feel the GPU infrastructure squeeze.

Broad strategies to overcome the global GPU shortage

There’s a few key techniques we’ve discovered alongside our customers as we’ve built the Ori GPU Cloud offering. We’ll share what we’ve learned—as long as you don’t mind crossing a few sailing analogies.

Working with our AI customers who build and deploy models at scale feels like navigating a storm at sea. Seasoned sailors never want to challenge one head on, but it’s important to know how to navigate extreme weather regardless. Kubernetes and Helm charts become our navigation tools, shipping AI apps to managed Kubernetes environments and solving the complexity of shifting clouds. Meanwhile, providing a GPU cloud infrastructure ensures our clients don’t wait for hours for their engines to spin up.

Ori learnings that help AI companies beat the GPU shortage

  • Renting GPU cloud compute is one of the best ways for AI projects to scale workloads up and down, spin up fast, and stay cost-conscious.

  • Selecting the right GPUs is just as important as access.

  • Widening the selection of GPU cloud providers helps scaling challenges by preventing vendor lock-in and access issues.

  • Multicloud networking enables deploying AI models across diverse cloud environments without compromising operational efficiency.

  • Taking advantage of Kubernetes-native solutions enables highly secure multi-node instances network and storage.

  • Modelling inference architecture at scale is something many AI projects don’t think about early on, but might want to start now.

  • Tight partnerships with your GPU providers can provide benefits that the biggest cloud providers simply can’t. 

  • A specialised, auto-scaling GPU cloud provider can reduce costs related to AI inference tasks from £75k to £15k in just one month.

Select the right GPU fleet for the job

Much of today’s perceived GPU shortage is largely a result of improper utilisation and overspending on GPUs that fail to meet the specific requirements of large-scale AI applications. While low-latency workloads certainly want an H100 to run, chip manufacturers and the media have created the hype that AI developers must possess only the most powerful GPUs. That simply isn't the case. Even A100s are overboard for most uses. AI teams can easily rely on older V100s for instance, to speed up pipelines of model inference while drastically reducing cost and ensuring high availability. It all depends on what infrastructure you need, tailored to the job.

Consider these tips when shopping your GPU selection:

  • Estimate GPU requirements to “foresee” compute needs on the horizon

  • Segment tasks into those that require low-latency, and those that don’t.

  • Review memory and bandwidth capabilities of your GPU offer.

  • Understand your options for downscaling (e.g. H100 GPUs for LLM workloads have higher lead times and price points, while A100s can run the same workloads at lower price points).

  • Work with your cloud provider to understand how their GPU offers match up with your specific workload demands and budgets to make decisions—there are dozens of angles to cut costs and it’s never one-size-fits-all.

Broaden GPU selection by using multiple providers

While single GPU cloud providers claim to host a wide selection of GPUs, some rentals are incredibly expensive and don’t offer much flexibility. A truly tailored GPU infrastructure starts to take shape when companies can lash together their compute resources.

In some cases, startups with cloud credits from the major public clouds can't use them access to more powerful GPUs!

This is where multicloud capabilities are so invaluable, and the one of the reasons why Ori has been chosen by customers over others—we’ve been doing multicloud longer than most of the specialised GPU clouds.

While multicloud strategies offer flexibility and redundancy, they introduce additional layers of complexity. Coordinating AI workloads seamlessly across different cloud providers demands careful consideration of compatibility, data transfer, and service interoperability.

Things we consider fundamental to next-gen GPU cloud infrastructure:

  • Pooling GPUs from multiple cloud providers with Ori Global Cloud platform and running on top of a Kubernetes-native infrastructure.

  • Using free credits strategically. Startups can diversify usage by hopping to another cloud when credits run out, or become too limiting due to quotas, or cost-inefficiencies.

  • Kubernetes-powered autoscaling flexibly adjusts your infrastructure to align with your usage patterns. Scaling in real-time based on CPU, GPUs, memory, or requests tailors the scaling process to your needs.

  • On-demand access, with high availability. Assemble the infrastructure you need, when you need it, and provision accordingly (not just what's available from your single cloud provider at a distant point in the future).

Multicloud networking means more sunny skies

Enterprises often find themselves traversing multicloud environments. While these strategies offer flexibility and redundancy, they can introduce additional layers of complexity that stall out AI teams who don’t want to spend their time focused on infrastructure. Coordinating AI workloads seamlessly across different cloud providers demands careful consideration of compatibility, data transfer, and service interoperability.

Consider a scenario where an organisation utilises generative AI models for creative design processes. The flexibility of multicloud deployment ensures that these AI workloads can seamlessly leverage resources from different cloud providers, optimising costs and ensuring consistent performance.

Here, Kubernetes shines as a unifying force. Its agnostic nature allows for consistent workload management, irrespective of the underlying infrastructure. By leveraging Kubernetes, enterprises can deploy AI models across diverse cloud environments without compromising operational efficiency (more on that in the next section).

Taking advantage of Kubernetes to steer safely to shore

Modern multicloud systems typically involve Kubernetes—the robust, open-source container orchestration platform. Kubernetes provides a standardised way to deploy, manage, and scale containerized applications, acting like a maritime logistics platform for moving AI workloads efficiently, and without incident. 

Its declarative configuration and automated scaling capabilities streamline the deployment of AI models. For instance, consider a case where a generative AI model is deployed for content creation. Kubernetes ensures that as the demand for content generation increases, the model scales horizontally to meet the rising workload. 

However, like any ship navigating stormy waters, Kubernetes isn't without its challenges. Orchestrating AI workloads efficiently requires a deep understanding of resource utilisation, networking, and storage configurations. Ensuring optimal performance and resource allocation becomes a puzzle that demands continuous refinement. 

Key strategies we've learned working with Kubernetes + GPU infra:

  • Ask your GPU cloud provider about application orchestration and Kubernetes. If GPUs are the hardware to run your workloads, Kubernetes is the fabric to optimise cost and ensure high uptime. K8s has emerged as a powerful orchestrator for managing AI/ML workloads, offering a range of benefits that streamline deployment, optimise resource utilisation, and enhance overall efficiency.

  • Consider K8s scalability for AI workloads when you need low-latency, production-ready power, adapting to fluctuating resource requirements with ease. This is particularly crucial for inference workloads, which exhibit more dynamic resource utilization compared to training workloads. Inference-based AI/ML applications often demand significant resources and may require frequent scaling up or down based on the volume of data being processed.

  • K8s' automated scheduling capabilities significantly reduce the operational burden on your MLOps teams. By intelligently assigning AI workloads to nodes with the necessary resources, K8s ensures that applications always have the required compute power and memory, optimising performance and minimising potential bottlenecks.

  • Lean into portability. K8s-based AI applications can be effortlessly migrated between different environments, ensuring flexibility and ease of management. This portability is crucial for deploying and managing AI/ML workloads in hybrid infrastructure, enabling seamless transitions between on-premises and cloud-based environments.

Taking advantage of Kubernetes to steer safely to shore

There is a lot of great content out there about opting for Kubernetes orchestrators to handle inference at scale. Essentially, real-time inference is where AI companies can make the biggest bang and disrupt their industry. Think of real-time fraud detection, real-time video creation from text input, or customer service bots that need to reply in seconds before losing engagement. 

Yaron Sternbach has a great writeup on developing model inference at scale, and how his team developed a high-level architecture of serving models across Kubernetes clusters. Here are some of our learnings around designing architecture for inference use cases:

  • Reduce overall AI inference costs by selecting the right GPUs, at the cheapest price, optimised for the right AI workload.

  • Plan early on how you’re going to scale data storage, networking and accelerated compute for training and inferencing tasks

  • Be proactive with your AI deployment plan and prioritise GPUs with memory capacity far exceeding current requirements.

Get trusted professional services and bring your ship to shore

Crewing an AI team is an exercise in finding specialists who can help navigate changing weather on a day-to-day basis. Relying solely on a GPU cloud provider that serves as a mere GPU shelf can hinder progress and limit the potential of AI initiatives. Instead, AI teams should engage with professional services offered by their GPU cloud providers and Kubernetes experts to gain specialised guidance. Your crew manifest should include an infrastructure provider who knows how to work with GPU-accelerated workloads at enterprise scale. We’ve picked up a few industry metrics along the way, around specialised GPU cloud provisions: 

  • Specialised GPU clouds can provide up to 35x faster velocity, and 80% in cost savings compared to the biggest cloud providers out there.
  • MLOps and DevOps Talent will remain in short supply. While automation may replace some data science and researcher roles, deployment, DevOps and MLOps will only become more competitive in hiring.



Ready to weather the coming GPU demand storm?

As we navigate the stormy seas of scaling AI workloads with our customers, Kubernetes, multicloud strategies, and GPU availability have emerged as crucial points of sail.

In our experience, the best setup for scaling AI workloads on Ori looks like this in a nutshell:

  • Ori Global Multicloud: Kubernetes provides the necessary orchestration layer, while multicloud strategies offer flexibility, portability, and most of all, the scalability which AI inference demands due to “bursty” compute workloads (intense usage and high-uptimes needed sometimes, and complete dormancy in others).

  • Ori GPU cloud compute: Concentrate on crafting and refining your AI/ML models, Ori handles the accessibility and availability of your GPUs, with lightning-fast provisioning, low latency, and the flexibility to scale into the GPUs you require on-demand.

  • DevOps as a bonus. AI engineers without any DevOps expertise can comfortably operate a complex multicloud architecture with Ori.

Post co-authored with Brenden Arakaki. Ori GPU Cloud offerings available at https://ori.co/gpu-compute.

Ori GPU Compute Hero Visual-png

 

Need more? Check out this post too:
Deploy LLMs on any cloud with Ori

 

 

Similar posts