Analysis

A deep dive into NVIDIA’s Blackwell platform: B100 vs B200 vs GB200 GPUs

Explore the NVIDIA Blackwell GPU platform, featuring powerful superchips like B100, B200, and GB200. Discover how these GPUs are about to unleash a new wave of AI computing in this in-depth analysis.

NVIDIA unveiled its latest GPU platform Blackwell at GTC earlier this year. This new platform, named after the pioneering mathematician and statistician David Blackwell, includes two powerful GPUs - the B100 and B200, as well as the GB200 supercomputer series. In this blog post, we explore what makes Blackwell GPUs unique and how they can unleash the next wave of AI computing.

What’s new in NVIDIA’s next generation of GPUs

AI Superchip: Each Blackwell superchip consists of two dies connected by 10TB/s C2C (chip-to-chip) interconnect, coming together as a single GPU with full cache coherence. These dies built with TSMC’s custom 4NP fabrication process feature a whopping 208 billion transistors compared to the 80 billion transistors in Hopper.
 
The new NVIDIA Blackwell chips offer larger memory capacity for bigger models and more than double the memory bandwidth. This is crucial because a memory wall can prevent large AI models from taking full advantage of GPU processing power. Another key feature of Blackwell is its incredibly fast GPU-to-GPU connection, enabling multiple GPUs to work together as unified compute blocks. Here’s the NVIDIA B100 vs H100 feature comparison: 

 
HGX H100 8-GPU
HGX B100 8-GPU
Form factor
8x NVIDIA H100 SXM
8x NVIDIA B100 SXM
Aggregate Memory Capacity
Up to 640GB
Up to 1.5TB
Aggregate Memory Bandwidth
27TB/s
64TB/s
Aggregate GPU-GPU Bandwidth
7.2TB/s
14.4TB/s
NVLink
4th Gen
5th Gen
NVSwitch
3rd Gen
4th Gen
2nd Gen Transformer Engine: features 5th Gen Tensor cores that support new quantization formats and precisions. This engine will greatly speed up the inference of Mixture of Experts (MoE) models by using dynamic range management and advanced microscaling formats. The dynamic range allows the engine to adjust and refine numerical formats to lower precision, continuously optimizing the models for better performance.
 
  HGX H100 8-GPU HGX B100 8-GPU
Tensor Core precisions FP64, TF32, BF16, FP16, FP8, INT8 FP64, TF32, BF16, FP16, FP8, INT8, FP6, FP4
CUDA® Core precisions FP64, FP32, FP16, BF16, INT8 FP64, FP32, FP16, BF16
NVIDIA Blackwell vs Hopper, B100 vs H100 specs





Source: NVIDIA - Blackwell HGX performance data
MoE models are significantly faster when it comes to running inference when compared to equivalent non-expert models due to the efficiency of conditional computing and sparsity from expert parallelism.
 
However, these models require more VRAM because the system must load all experts and their parameters into memory. In addition to the substantially higher memory capacity and bandwidth, Blackwell’s lower precision formats and micro scaling help alleviate this problem by enabling larger models with more parameters to fit into GPUs.
 
NVIDIA B200 specs and benchmark performance
Source: NVIDIA - Blackwell HGX performance data
This paper on Microscaling (MX) formats for generative AI discusses benchmark findings, showcasing the impressive potential of smaller precision formats for both training and inference, with only minor accuracy losses. As smaller precision formats advance, more ML developers are likely to embrace these innovations for model development.
 
NVIDIA B200 training performanceSource: NVIDIA - Blackwell HGX performance data
The new Transformer Engine speeds up LLM training by enhancing the Nemo Framework and integrating expert parallelism techniques from Megatron-Core. We expect these advancements to pave the way for creating the first 10 trillion parameter model.
 
5th Generation NVLink: At 1.8TB/s bidirectional throughput per GPU, this new generation of GPU to GPU interconnect is twice as fast as the previous gen and can enable seamless high-speed communication among up to 576 GPUs. Accelerated in-network computation makes NCCL collective operations more efficient and helps GPUs reach synchronization faster. The latest generation of NVLink NVSwitch enables multi-GPU clusters such as the GB200 NVL72 for an accumulative bandwidth of 130TB/s for large models. 

Confidential computing enhancements: The latest Blackwell GPUs now feature Trusted Execution Environment (TEE) technology. While CPUs have supported TEE for a long time to ensure data confidentiality and integrity in applications like content authentication and secure financial transactions, NVIDIA GPUs now also offer TEE-I/O capabilities. This means enhanced data protection through inline protection on NVLink connections. Additionally, Blackwell GPUs provide data encryption at rest, in motion, and during computation.
 
Superfast decompression for data analytics:  Blackwell can decompress data at a blistering 800GB/s speed with formats such as LZ4, Snappy and Deflate. The GB200 GPU charged by 8TB/s bandwidth of HBM3e (High Bandwidth Memory) and the lightning-fast NVLink-C2C interconnect of the Grace CPU makes the data pipeline extremely fast. NVIDIA’s benchmarks run on a GB200 GPU cluster reveal 18x faster queries/sec than a traditional CPU and 6x faster than an H100 GPU, making GPUs more suitable for data analytics and database workflows.
 
Reliability, availability and serviceability (RAS) engine: performs automatic, built-in tests on computational cores and memory in the Blackwell chip. This is especially important for large supercomputer clusters, as it allows teams to replace underperforming GPU boards and keep performance high while protecting their GPU investments.
 
NVIDIA B100 B200 GPU availability
 

Understanding the Blackwell GPU lineup: B100 vs B200 vs GB200

The NVIDIA Blackwell family of GPU-based systems comprises HGX B100, HGX B200, DGX B200, and NVIDIA DGX supercomputers such as GB200 NVL36, and GB200 NVL72. The table below lists their specs and performance benchmarks provided by NVIDIA:
 
 
HGX B100
HGX B200
GB200 NVL72
Form factor
8x NVIDIA B100 SXM
8x NVIDIA B200 SXM
1:2 GB Board (36 Grace CPU:72 Blackwell GPUs)
CPU Platform
x86
x86
Grace powered by 2,592 Arm® Neoverse V2 cores
Aggregate Memory
Up to 1.5TB
Up to 1.5TB
Up to 13.5 TB 
Aggregate GPU - GPU Bandwidth
14.4 TB/s
14.4 TB/s
130 TB/s
FP4 Tensor Core
112 PFLOPS
144 PFLOPS
1,440 PFLOPS
FP8/FP6 Tensor Core
56 PFLOPS
72 PFLOPS
720 PFLOPS
INT8 Tensor Core
56 POPS
72 POPS
720 POPS
FP16/BF16 Tensor Core
28 PFLOPS
36 PFLOPS
360 PFLOPS
TF32 Tensor Core
14 PFLOPS
18 PFLOPS
180 PFLOPS
FP32
480 TFLOPS
640 TFLOPS
6,480 TFLOPS
FP64
240 TFLOPS
320 TFLOPS
3,240 TFLOPS
FP64 Tensor Core
240 TFLOPS
320 TFLOPS
3,240 TFLOPS

Scale models to multi-trillion parameters with NVIDIA GB200 supercomputers

The GB200 superchip forms the core of GB200 supercomputers, combining 1 Grace CPU and 2 Blackwell GPUs in a memory-coherent, unified memory space.
 
The GB200 system comes in different versions, such as the GB200 NVL36 and GB200 NVL72, depending on the number of GPUs. Each rack can hold 9 or 18 GB200 compute node trays, depending on the design. These racks include cold plates and connections for liquid cooling, PCIe Gen 6 for fast networking, and NVLink connectors for seamless NVLink cable integration.
 
  • GB200 NVL36 is one rack of 9x dual-GB200 (4 GPUs, 2 CPUs) compute nodes and 9x NVSwitch trays
  • GB200 NVL72 can be two racks of 9x trays of dual-GB200 compute nodes and 9x NVSwitch trays
  • GB200 NVL72 can also be one rack of 18x trays of dual-GB200 compute nodes and 9x NVSwitch trays
Here’s why GB200 supercomputers are perfect to handle the complexity of large models:
 
  1. AI performance advantage from a massive compute block: The GB200 superchip is equipped with a new transformer engine, fifth-generation NVLink and 1.8 TB/s of GPU-to-GPU interconnect delivers 4X faster training performance for large language models like GPT-MoE-1.8T. The superchip also features InfiniBand networking and NVIDIA Magnum IO™ software ensures efficient scalability of extensive GPU computing clusters with up to 576 GPUs.
  2. Grace CPU with superior LPDDR5X memory: The Grace CPU is a powerhouse of 144 ARM v9 Neoverse cores delivering up to 7.1 TFLOPS of performance and can access 960GB of LPDDR5X RAM at 1TB/s memory bandwidth. This ultra-fast, low-power memory accelerates transactions while maintaining data integrity through error correction code (ECC), making it suitable for critical workloads.

  3. Blazing fast CPU interconnect with simplified NUMA:  powered by 900GB/s NVLINK C2C interconnect which is several times faster than traditional PCIe interconnect.
NVIDIA Grace CPU architecture
Source: NVIDIA - Grace CPU Whitepaper
 

NVIDIA GB200 GPU pricing

Explore use cases for GB200 AI supercomputers

Here are some examples of use cases for supercomputers such as NVIDIA GBL200 NVL72:
 
  • Unlocking training for trillion parameter models: With state-of-the-art (SOTA) models increasingly featuring more than trillion parameters, training a 1.8 trillion parameter is 4 times faster than an equivalent H100 GPU cluster.

NVIDIA GB200 NVL72 Training

  • Scalable inference for powerful generative AI models:  A Menlo Ventures study of business leaders on adopting generative AI revealed that 96% of computing spend on generative AI goes towards inference, highlighting the importance of optimizing performance for better ROI. The two main factors to consider when implementing inference are size and speed - businesses aim to offer instant experiences to their users as they transform their products and services with AI, regardless of the size of their customer base. 

NVIDIA GB200 NVL72 Inference

This is where the GB200 NVL72 cluster becomes crucial, providing up to 30 times better inference performance at real-time speeds bringing the scalability benefits of Blackwell's architecture to practical inference use cases in business and consumer applications.
 
  • Seamless execution of Mixture of Experts (MoE) models: The massive aggregate memory of up to 13.5 TB in GB200 systems and incredibly fast GPU interconnect helps AI teams realize the potential of MoE models better than ever before. The visualization below shows how experts in an MoE model communicate with each other and across the model's layers. Without Blackwell’s NVLINK interconnect, NVIDIA estimates that GPUs would spend half their time on communication instead of computation.
Mixture of Experts (MoE) performance

  • Superlative vector database and retrieval-augmented generation (RAG) performance: Grace CPU’s 960GB of memory and 900GB/s C2C link is perfect to accelerate RAG pipelines via  low-latency vector search. 

  • Sustainable AI computing: Combined with energy savings from liquid cooling and the efficiency of the GB200 supercomputing system, GB200NVL72 is 25x more energy efficient when compared to an equivalent NVIDIA H100 cluster.

Power your AI with Blackwell GPUs on Ori

Want to leverage the power of NVIDIA Blackwell GPUs for your AI-focused business? Ori’s AI Native cloud is purpose-built for AI/ML workloads featuring top-notch GPUs, performant storage and AI-ready networking so you can:
 

Train and serve world-changing AI models on Ori! Reserve your Blackwell GPUs today!

Similar posts

Join the new class of AI infrastructure! 

Build a modern GPU cloud with Ori to accelerate your AI workloads at scale.