Nvidia unveils Blackwell B200 GPU, the world’s most powerful chip for AI

Nvidia’s indispensable H100 AI chip made it a billion-dollar company, one that may be worth more than Alphabet and Amazon, and competitors have been fighting to catch up. But Nvidia may be about to extend its lead – with the new Blackwell B200 GPU and GB200 ‘superchip’.

a:hover]:text-gray-63 [&>a:hover]:shadow-underline-black dark:[&>a:hover]:text-gray-bd dark:[&>a:hover]:shadow-underline-gray [&>a]:shadow-underline-gray-63 dark:[&>a]:text-gray-bd dark:[&>a]:shadow-underline-gray”>Image: Nvidia

Nvidia says the new B200 GPU offers up to 20 petaflops of FP4 horsepower from its 208 billion transistors and that a GB200 combining two of those GPUs with a single Grace CPU can deliver 30 times the performance for LLM inference workloads, while also potentially being significantly more efficient. It “reduces costs and energy consumption by up to 25x” versus an H100, Nvidia says.

Training a 1.8 trillion parameter model would previously have required 8,000 Hopper GPUs and 15 megawatts of power, Nvidia claims. Today, Nvidia’s CEO says 2,000 Blackwell GPUs can do this while consuming just four megawatts.

Based on a GPT-3 LLM benchmark with 175 billion parameters, Nvidia says the GB200 delivers a slightly more modest seven times the performance of an H100, and Nvidia says it offers four times the training speed.

a:hover]:text-gray-63 [&>a:hover]:shadow-underline-black dark:[&>a:hover]:text-gray-bd dark:[&>a:hover]:shadow-underline-gray [&>a]:shadow-underline-gray-63 dark:[&>a]:text-gray-bd dark:[&>a]:shadow-underline-gray”>Image: Nvidia

Nvidia told reporters that one of the key improvements is a second-generation transformer engine that doubles computing power, bandwidth and model size by using four bits for each neuron instead of eight (i.e. FP4’s twenty petaflops I mentioned earlier). A second key difference comes when you connect large numbers of these GPUs together: a next-generation NVLink switch that allows 576 GPUs to talk to each other, with 1.8 terabytes per second of bidirectional bandwidth.

That required Nvidia to build an entirely new network switch chip, one with 50 billion transistors and some of its own on-board computing: FP8’s 3.6 teraflops, Nvidia says.

a:hover]:text-gray-63 [&>a:hover]:shadow-underline-black dark:[&>a:hover]:text-gray-bd dark:[&>a:hover]:shadow-underline-gray [&>a]:shadow-underline-gray-63 dark:[&>a]:text-gray-bd dark:[&>a]:shadow-underline-gray”>Image: Nvidia

Previously, Nvidia says, a cluster of just sixteen GPUs spent 60 percent of their time communicating with each other and only 40 percent computing.

Nvidia, of course, is counting on companies to buy large quantities of these GPUs and is packing them into larger designs, like the GB200 NVL72, which packs 36 CPUs and 72 GPUs into a single liquid-cooled rack, for a total of 720 petaflops of capacity. AI training performance or 1,440 petaflops (also known as 1.4 exaflops) of inference. It contains almost three kilometers of cables, with 5,000 individual cables.

a:hover]:text-gray-63 [&>a:hover]:shadow-underline-black dark:[&>a:hover]:text-gray-bd dark:[&>a:hover]:shadow-underline-gray [&>a]:shadow-underline-gray-63 dark:[&>a]:text-gray-bd dark:[&>a]:shadow-underline-gray”>Image: Nvidia

Each drawer in the rack holds either two GB200 chips or two NVLink switches, with 18 of the former and nine of the latter per rack. In total, Nvidia says one of these racks can support a 27 trillion parameter model. GPT-4 is rumored to be around a 1.7 trillion parameter model.

The company says Amazon, Google, Microsoft and Oracle are all already planning to offer the NVL72 racks in their cloud service offerings, although it’s not clear how many they are buying.

And of course, Nvidia is also happy to offer companies the rest of the solution. Here’s the DGX Superpod for DGX GB200, which combines eight systems into one for a total of 288 CPUs, 576 GPUs, 240 TB of memory, and 11.5 exaflops of FP4 computing.

Nvidia says its systems can scale to tens of thousands of GB200 superchips, connected to 800Gbps networks with its new Quantum-X800 InfiniBand (for up to 144 connections) or Spectrum-X800 ethernet (for up to 64 connections).

We don’t expect to hear anything about new gaming GPUs today, as this news comes from Nvidia’s GPU Technology Conference, which usually focuses almost entirely on GPU computing and AI, not gaming. But the Blackwell GPU architecture will likely power future RTX 50 series desktop graphics cards as well.

Leave a Reply

Your email address will not be published. Required fields are marked *