The age of artificial intelligence is no longer measured by what models can learn, but by what the underlying hardware enables them to achieve. Over the past three years, we have witnessed foundation models grow from impressive research demonstrations into the operational backbone of industries — powering everything from autonomous code generation to real-time drug discovery. Yet behind every breakthrough model lies an infrastructure reality: the silicon must keep pace with the ambition. With the unveiling of the NVIDIA Vera Rubin platform at GTC 2026, that pace has taken a dramatic leap forward — one that fundamentally changes what AI architects, infrastructure engineers, and model builders can accomplish.
The platform’s name honors Vera Rubin, the American astronomer whose observations of galaxy rotation curves provided the first compelling evidence for dark matter — a fitting namesake for hardware designed to reveal the hidden potential of AI models that current silicon cannot fully exploit.
This article explores how the Vera Rubin platform — built around the Rubin R100 GPU and the Vera CPU — transforms the landscape for foundation model training, Mixture-of-Experts (MoE) architectures like DeepSeek V3 and Qwen3, and the emerging class of agentic AI systems. We will examine the hardware through the lens of the workloads it serves, discuss what AI-infrastructure teams must rethink in the Rubin era, and consider how this new silicon invites us to redesign the very architecture of large language models.

To understand why Vera Rubin matters, we need to appreciate that it is not merely a faster GPU. It is six co-designed chips engineered as a unified AI computing platform [1]. The Rubin R100 GPU delivers 336 billion transistors on TSMC’s 3nm process, 288 GB of HBM4 memory with 22 TB/s bandwidth, and 50 PFLOPS of FP4 inference per chip. The Vera CPU contributes 88 custom Olympus cores on Arm v9, purpose-built for the sequential reasoning that agentic workloads demand. NVLink 6 binds it all together with 3.6 TB/s per GPU and 260 TB/s across the NVL72 rack — aggregate bandwidth that, according to NVIDIA, exceeds the total traffic of the global internet backbone, a claim that underscores the sheer scale of on-rack communication [1][2].


Think of it this way: if previous GPU generations were powerful engines bolted onto a standard chassis, Vera Rubin is a Formula One car where the engine, aerodynamics, suspension, and tires were all designed as one inseparable system. Every chip exists to remove a bottleneck that the others would create.
This matters because bottlenecks in AI computing are rarely isolated. Faster computing exposes memory bandwidth limits. Wider memory bandwidth exposes interconnect limits. More interconnect bandwidth exposes CPU orchestration limits. Vera Rubin attacks all of these simultaneously. The ConnectX-9 SuperNIC provides 1.6 Tb/s scale-out networking, the BlueField-4 DPU handles security and storage offload, and the Spectrum-6 Ethernet switch delivers 102.4 Tb/s with co-packaged optics. It is not six chips assembled from available parts — it is six chips designed as one system [1].

Before we dive into workloads, it is worth pausing on the complete specification table. The numbers below are preliminary and subject to change, but they paint a clear picture of the generational leap at every level — from a single GPU to the Superchip (2 GPUs + 1 CPU) to the full NVL72 rack [2].

A few numbers deserve special attention. The NVL72 rack packs 1,296 chips into a single liquid-cooled enclosure. Its 3,168 Olympus CPU cores provide the sequential compute muscle for agentic orchestration, while 54 TB of LPDDR5X CPU memory serves as an extended memory tier for KV cache overflow and agent state management. The 65 TB/s of aggregate NVLink-C2C bandwidth across the rack means every CPU-GPU pair communicates coherently, without PCIe bottlenecks.

This co-design philosophy has profound implications for the three workload categories that dominate modern AI: foundation model training, sparse Mixture-of-Experts inference, and agentic AI reasoning. Let us examine each one.
Foundation model training is an exercise in sustained computation. Training a frontier model like GPT-4, DeepSeek V3, or Qwen3 requires processing trillions of tokens through billions of parameters, synchronizing gradients across hundreds or thousands of GPUs, and doing so for weeks or months without interruption.

The R100 GPU delivers 35 PFLOPS of FP4 training per chip — roughly 3.5 times the throughput of NVIDIA’s Blackwell B200 [2][3]. At the rack level, a single Vera Rubin NVL72 (72 GPUs, 36 CPUs) reaches approximately 2.5 EFLOPS of FP4 training throughput. To put this in perspective: DeepSeek V3 consumed 2.788 million H800 GPU-hours to train on 14.8 trillion tokens at a remarkably efficient cost of approximately $5.6 million [4] — a figure that stunned the industry given that comparable frontier models were estimated at $100 million or more. With Vera Rubin’s generational improvement in throughput and bandwidth, that same training run could be completed in a fraction of the time at a fraction of the cost.
Perhaps the most significant training innovation in Vera Rubin is NVIDIA’s NVFP4 format. DeepSeek V3 was the first large-scale model to validate FP8 mixed-precision training at production scale [4], proving that reduced precision does not mean reduced quality when done carefully. Vera Rubin’s third-generation Transformer Engine takes this principle further: it adaptively selects the optimal numerical precision — FP4, FP6, FP8, or FP16 — per layer and per tensor, in real time [1][5].
Think of the Transformer Engine’s adaptive precision as a professional photographer who adjusts aperture, ISO, and shutter speed for each shot. Some layers require the full fidelity of FP16 (like a portrait demanding razor-sharp detail), while others tolerate the efficiency of FP4 (like a landscape where minor detail loss is imperceptible). The hardware makes this judgment dynamically for each tensor, freeing researchers to focus on model architecture rather than precision engineering. NVIDIA has demonstrated that NVFP4 training matches the downstream task accuracy of FP16 across multiple architectures, while delivering roughly twice the throughput of FP8 [6][7].
For AI-infrastructure teams, this is both an opportunity and a mandate. The teams responsible for training pipelines must now rethink their precision strategies, profiling tools, and checkpointing workflows to fully exploit adaptive compression. The era of static precision recipes is ending.
Training large models is not purely a compute problem — it is also a memory bandwidth problem. During the backward pass, gradients must be read and written across optimizer states, activations, and model parameters. The R100’s HBM4 memory provides 22 TB/s of bandwidth per GPU, a 2.75x increase over Blackwell’s 8 TB/s HBM3e [1][2]. At the rack level, the NVL72 delivers an aggregate of approximately 1.58 PB/s of memory bandwidth across 72 GPUs [8].
This bandwidth headroom changes the calculus for training. Larger micro-batches become feasible without memory stalls. Activation checkpointing — a technique that trades compute for memory — becomes less necessary, saving training time. Gradient accumulation steps can be reduced, improving convergence speed.
The Mixture-of-Experts paradigm has emerged as the dominant strategy for building efficient, high-capacity models. Rather than activating all parameters for every token, MoE models route each token through a small subset of specialized “expert” networks — dramatically reducing compute cost while maintaining the total parameter capacity needed for broad knowledge.
DeepSeek V3 embodies this approach at scale: 671 billion total parameters, but only 37 billion active per token, distributed across 256 routed experts with 8 activated per token [4][9]. Qwen3–235B-A22B follows a similar philosophy with 128 experts and 22 billion active parameters [10][11]. These architectures achieve frontier-level performance at a fraction of the training and inference cost of equivalent dense models.

But MoE architectures have a fundamental infrastructure dependency: they are extraordinarily demanding on memory bandwidth and inter-GPU communication. MoE routing is akin to a hospital triage system: instead of every patient seeing every specialist, a dispatcher routes each patient to the two or three doctors most relevant to their symptoms. The system works brilliantly — until the hallways become congested or the dispatch desk cannot keep up. This is precisely where Vera Rubin excels: it widens the hallways (HBM4 bandwidth) and accelerates the dispatch (NVLink 6).
Consider the inference pipeline for DeepSeek V3. For each token, a gating network evaluates all 256 experts to determine which 8 to activate. The weights of those 8 experts must then be loaded from memory, the forward pass computed, and the results combined. This creates two bottlenecks: the memory bandwidth to load expert weights, and the network bandwidth to shuffle activations when experts reside on different GPUs.
On an H200 with 4.8 TB/s HBM3e bandwidth, loading DeepSeek V3’s 37 billion active parameters (approximately 37 GB at FP8) takes roughly 7.7 milliseconds. On the R100 with 22 TB/s HBM4, the same operation completes in approximately 1.7 milliseconds — a 4.6x improvement [1][2]. This directly translates to higher tokens-per-second throughput for inference.
The full 671 billion parameter model at FP8 occupies roughly 671 GB — far exceeding a single GPU’s 288 GB capacity. But the Vera Rubin NVL72 rack provides 20.7 TB of aggregate HBM4 [3], comfortably accommodating the entire model with generous room for KV caches, activations, and batched requests.
The inter-GPU communication story is equally compelling. NVLink 6 provides 3.6 TB/s bidirectional bandwidth per GPU and 260 TB/s across the rack, enabling all-to-all connectivity where any GPU can communicate with any other at full speed [1][2]. For DeepSeek V3 with 256 experts distributed across 72 GPUs (roughly 3–4 experts per GPU), the all-to-all shuffle that routes token activations to the correct expert GPUs becomes near-instantaneous. At GTC 2026, NVIDIA specifically highlighted “wide expert parallelism” — placing each expert on a separate GPU — as a strategy enabled by NVLink 6’s bandwidth [12].
Qwen3–235B-A22B presents a particularly striking example of Vera Rubin’s impact. At FP4 precision, the entire 235 billion parameter model compresses to approximately 118 GB — well within a single R100 GPU’s 288 GB HBM4 capacity, leaving roughly 170 GB for KV cache, activations, and runtime data [10][13]. This is a breakthrough: single-GPU inference of a 235-billion-parameter model was simply not possible on any previous generation hardware. Blackwell’s B200, with 192 GB HBM3e, cannot even hold Qwen3–235B at FP8 (approximately 235 GB) on a single device.
For cloud providers and AI infrastructure teams, this has immediate practical implications. Serving Qwen3–235B on a single R100 GPU eliminates the overhead of tensor parallelism — no inter-GPU communication during inference, no synchronization points, no load balancing across GPUs. The result is lower latency, higher throughput, and simpler deployment.
DeepSeek V3’s Multi-Head Latent Attention (MLA) is an elegant innovation that compresses the KV cache by projecting keys and values into a low-rank latent space, reducing KV cache memory by up to 90% or more compared to standard multi-head attention [14][15]. However, even with this compression, decoding long contexts remains a bandwidth-bound operation: the compressed latent representations for all previous tokens must be read at every layer for every new token generated.
At 22 TB/s, Vera Rubin can read these compressed caches dramatically faster than any previous generation. The practical outcome is that MLA-based models can serve longer effective context windows with higher concurrent throughput and lower time-to-first-token latency. MLA was designed for bandwidth efficiency — and HBM4 finally provides the bandwidth to unlock its full potential.
Agentic AI represents a fundamentally different workload pattern from traditional inference. An AI agent does not simply generate a response to a prompt — it reasons through a multi-step plan, executes actions (calling tools, querying databases, writing code), evaluates results, adjusts its strategy, and iterates. This “think-act-think-act” loop combines sequential reasoning with parallel computation in rapid alternation [1][16]. The CPU-GPU interplay here is like a chess grandmaster (the CPU) who deliberates on strategy, then instructs a team of builders (the GPU) to execute each move at superhuman speed — the game advances only as fast as the two can coordinate.

Models like Qwen3 are particularly well-suited to this pattern: Qwen3 supports both a “thinking” mode (extended chain-of-thought reasoning) and a “non-thinking” mode (fast direct response), toggled via system prompts [11]. This hybrid capability maps naturally to the Vera Rubin architecture, where deep reasoning chains leverage the Vera CPU’s sequential performance and the GPU’s parallel throughput in alternation.
The Vera CPU is purpose-built for the “think” phase. Its 88 custom Olympus cores deliver what NVIDIA claims is the highest single-thread performance of any data center CPU [16][17], critical for the inherently serial tasks of agent orchestration: tokenization, tool call parsing, KV cache scheduling, expert routing decisions, and multi-step planning. Meanwhile, the R100 GPU handles the “act” phase — each forward pass through the language model is a massively parallel tensor computation at up to 50 PFLOPS FP4.

The key enabler is the NVLink Chip-to-Chip (C2C) connection between the Vera CPU and R100 GPU, providing coherent memory access at 1.8 TB/s [1]. This means the CPU can directly read and write GPU HBM4 memory, and vice versa, without expensive data copies across PCIe. For agentic workloads where the CPU and GPU alternate responsibilities hundreds of times per query, this tight coupling eliminates the latency that would otherwise make sophisticated agent pipelines impractical.

Agents maintain a long, evolving context across many reasoning steps. Each action appends new information to the context, and the model must attend to the entire history during each subsequent inference pass. The KV cache — the accumulated key-value pairs from all previous tokens — grows continuously and must be read in full at each step.
With 288 GB of HBM4 per GPU and 22 TB/s bandwidth, a single R100 can maintain KV cache for approximately one to two million tokens for a DeepSeek-scale model using MLA at FP8 precision. At the rack level, 20.7 TB of total HBM4 can support context windows of tens of millions of tokens when combined with techniques like ring attention or context parallelism [3][18].
NVIDIA has also announced the Vera Rubin NVL144 CPX variant, which uses 128 GB GDDR7 memory per GPU instead of HBM4 — a deliberate architectural choice optimized for the prefill-heavy access patterns of million-token-plus context inference workloads [19]. This is not a coincidence — it reflects NVIDIA’s conviction that long-context agentic AI is the defining workload of this hardware generation.
The cost equation is perhaps the most transformative aspect. On previous-generation hardware, running a persistent monitoring agent (cybersecurity analysis, network anomaly detection, real-time market surveillance) cost $50–100 per GPU-hour at scale. NVIDIA claims a 10x reduction in inference token cost at rack scale with Vera Rubin compared to Blackwell [1][5], driven by three factors: 5.6x more FP4 compute, 2.75x more memory bandwidth, and 260 TB/s NVLink bandwidth enabling efficient MoE routing at rack scale.
This reduction transforms the economics of multi-agent pipelines, where 5–10 specialized sub-agents collaborate per task, and persistent monitoring agents that must operate 24/7. Workloads that were economically unviable on previous-generation hardware become commodity services on Vera Rubin.
Every generation of hardware has reshaped how model architects think about design tradeoffs. GPU-scale parallelism enabled the Transformer architecture. The scaling of HBM bandwidth made efficient attention mechanisms practical. NVLink enabled tensor and pipeline parallelism across devices. Vera Rubin invites the next evolution.
Current MoE architectures are constrained by the hardware they were designed for. DeepSeek V3’s 256 experts and Qwen3’s 128 experts represent the practical limit of what can be efficiently routed on today’s infrastructure. With Vera Rubin’s 20.7 TB of HBM4 and 260 TB/s NVLink bisection bandwidth, models could scale to 1,000 or more experts across 72 GPUs [3][12].
More experts means finer-grained specialization. Imagine a model where individual experts are dedicated to specific programming languages, scientific domains, reasoning strategies, or cultural contexts. The gating network becomes not just a load balancer, but an intelligent dispatcher that routes each token to the precise combination of specialized knowledge it needs. This level of granularity was architecturally conceivable before — but hardware-constrained. Vera Rubin removes that constraint.
Until now, model architects have designed architectures assuming FP16 or BF16 training, then applied quantization post-hoc for inference. Vera Rubin’s NVFP4 training capability and adaptive Transformer Engine create a new possibility: designing architectures from the ground up for FP4 precision.
What does this mean in practice? Wider layers, because each parameter consumes half the memory of FP8. Deeper networks within the same memory budget. Architectures that explicitly exploit the precision-bandwidth tradeoff — for example, using full FP16 precision for critical attention layers while running feed-forward networks at FP4, with the Transformer Engine managing the transitions automatically [1][5].
AI-infrastructure teams must prepare for this shift. Training frameworks will need first-class FP4 support. Profiling tools must expose per-layer precision decisions. Checkpointing strategies must handle mixed-precision state efficiently. The teams that build this tooling first will have a significant competitive advantage.
DeepSeek’s MLA demonstrated that novel attention mechanisms can achieve dramatic efficiency gains. But MLA made a specific tradeoff: compressing KV cache at the cost of some expressiveness. With 22 TB/s bandwidth, Vera Rubin reopens the design space.
Future architectures might employ hybrid attention patterns: dense, full multi-head attention for the most critical layers (where maximum expressiveness matters), combined with compressed or sparse attention for the majority of layers (where efficiency matters more). The third-generation Transformer Engine can apply different precision to attention scores, value projections, and output projections independently, maximizing throughput per layer [5].
Even more intriguing is the possibility of cross-sequence attention for agentic AI — architectures where attention spans across multiple related conversations, documents, or agent states simultaneously. These patterns are extremely bandwidth-hungry, and only become practical with HBM4-class memory systems.
The Vera CPU’s coherent connection to the R100 GPU via NVLink C2C opens a design dimension that did not exist before. Model architects can now explicitly design architectures where some computations — chain-of-thought planning, tool use parsing, dynamic expert routing decisions — execute on the CPU, while parallel tensor operations run on the GPU.
This is not just about offloading preprocessing. It is about co-designing the model’s reasoning pipeline to exploit both sequential and parallel hardware. Think of it as extending the model architecture beyond the neural network itself, into the orchestration layer that coordinates multi-step reasoning.
These architectural possibilities, however, will not realize themselves. Translating Vera Rubin’s potential into production systems falls to the teams who build and operate AI infrastructure.
The arrival of Vera Rubin does not eliminate the need for infrastructure optimization — it amplifies it. The gap between naive deployment and optimized deployment on this hardware is enormous. Teams that simply move their existing training and inference code to R100 GPUs will capture perhaps 30–40% of the platform’s potential. Teams that rethink their entire stack — precision strategies, parallelism configurations, memory management, KV cache policies, CPU-GPU workload distribution — will capture 80–90%.
Several specific challenges await AI-infrastructure teams:
It is worth emphasizing a powerful convergence: the models best positioned to exploit Vera Rubin — DeepSeek V3 and Qwen3 — are both open-weight. This means the broader AI community can experiment with MoE deployment strategies, benchmark expert parallelism configurations, and develop optimization tooling without waiting for proprietary model providers. The intersection of open-weight frontier models and next-generation hardware is a story about the democratization of AI at scale.
A responsible assessment must acknowledge what we do not yet know. Vera Rubin hardware is expected to ship in the second half of 2026, and no independent benchmarks exist at the time of writing. NVIDIA’s performance claims — the 10x inference cost reduction, the NVFP4 accuracy parity, the efficiency of adaptive precision — are based on NVIDIA’s own testing and have not been validated by third parties in production environments.
The competitive landscape also continues to evolve. AMD’s MI450, Google’s TPU v6, Amazon’s Trainium3, and Microsoft’s Maia 2 all represent alternative approaches to the same workloads. NVIDIA’s strength remains the CUDA ecosystem, software stack (NeMo, TensorRT-LLM, Triton Inference Server), and deep industry integration — but the gap is narrowing, and organizations should evaluate their specific workload requirements rather than assuming any single vendor is optimal for all use cases.
For teams planning infrastructure investments, the software readiness question is equally important: can existing PyTorch, vLLM, and TensorRT-LLM deployments fully exploit Vera Rubin’s adaptive precision and CPU-GPU co-design from day one? The answer is likely “partially,” with full software maturity following hardware availability by several months.

The NVIDIA Vera Rubin platform is not simply a faster GPU — it is an inflection point in the relationship between hardware and AI architecture. For the first time, the silicon is not merely keeping pace with model innovation; it is actively challenging model architects to think bigger. Thousand-expert MoE models, million-token agentic reasoning, native FP4 training, CPU-GPU co-designed inference pipelines — these are not distant possibilities but immediate opportunities for teams prepared to seize them.
DeepSeek V3 and Qwen3 represent the current frontier of what MoE architectures can achieve. On Vera Rubin, they run not just faster, but qualitatively differently: single-GPU inference of 235-billion-parameter models, near-instantaneous expert routing across racks, and long-context agent reasoning that was economically prohibitive just one generation ago. Both models are open-weight, meaning the tools and strategies developed for them benefit the entire community. The next generation of models — the ones being designed right now, with Vera Rubin’s capabilities as a design target — will push further still.
For AI-infrastructure teams, the message is clear: optimization is no longer optional, it is the differentiator. The teams that master adaptive precision, expert parallelism, CPU-GPU orchestration, and rack-scale memory management will define the competitive landscape of the next era of AI.
As someone who works at the intersection of AI models and infrastructure, I believe we are entering the most exciting period in the history of this field. The hardware is ready. The open models are ready. The question is whether we are ready to build the software, the architectures, and the systems that fully exploit what this silicon makes possible. The future of AI infrastructure is not about brute force — it is about intelligence at every level of the stack. Let us build it!
[1] NVIDIA Developer Blog, “Inside the NVIDIA Vera Rubin Platform: Six New Chips, One AI Supercomputer” — https://developer.nvidia.com/blog/inside-the-nvidia-rubin-platform-six-new-chips-one-ai-supercomputer/
[2] NVIDIA, “Infrastructure for Scalable AI Reasoning | NVIDIA Vera Rubin Platform” — https://www.nvidia.com/en-us/data-center/technologies/rubin/
[3] Hashrate Index, “NVIDIA Vera Rubin NVL72: Full Specs & Platform Breakdown” — https://hashrateindex.com/blog/nvidia-vera-rubin-nvl72-specs-breakdown/
[4] DeepSeek-V3 Technical Report, arXiv:2412.19437 — https://arxiv.org/html/2412.19437v1
[5] NVIDIA, “Infrastructure for Scalable AI Reasoning” (Rubin Platform Page) — https://www.nvidia.com/en-us/data-center/technologies/rubin/
[6] NVIDIA Developer Blog, “3 Ways NVFP4 Accelerates AI Training and Inference” — https://developer.nvidia.com/blog/3-ways-nvfp4-accelerates-ai-training-and-inference/
[7] NVIDIA Developer Blog, “NVFP4 Trains with Precision of 16-Bit and Speed and Efficiency of 4-Bit” — https://developer.nvidia.com/blog/nvfp4-trains-with-precision-of-16-bit-and-speed-and-efficiency-of-4-bit/
[8] Gigabyte, “NVIDIA Vera Rubin NVL72 Specifications” — https://www.gigabyte.com/FileUpload/Global/WebPage/1052/NVIDIA_2026_1H_V2.pdf
[9] InfoQ, “DeepSeek Open-Sources DeepSeek-V3” — https://www.infoq.com/news/2025/01/deepseek-v3-llm/
[10] Hugging Face, Qwen/Qwen3–235B-A22B Model Card — https://huggingface.co/Qwen/Qwen3-235B-A22B
[11] Qwen Blog, “Qwen3: Think Deeper, Act Faster” — https://qwenlm.github.io/blog/qwen3/
[12] NVIDIA GTC26 Session S81911, “Inside the NVIDIA AI Platform and Ecosystem” — https://www.nvidia.com/en-us/on-demand/session/gtc26-s81911/
[13] ApXML, “Qwen3–235B-A22B: Specifications and GPU VRAM Requirements” — https://apxml.com/models/qwen3-235b-a22b
[14] Chris McCormick, “The Inner Workings of Multihead Latent Attention (MLA)” — https://mccormickml.com/2025/04/26/inner-workings-of-mla/
[15] Medium, “DeepSeek-V3 Explained: Multi-head Latent Attention” — https://medium.com/data-science/deepseek-v3-explained-1-multi-head-latent-attention-ed6bee2a67c4
[16] NVIDIA Newsroom, “NVIDIA Launches Vera CPU, Purpose-Built for Agentic AI” — https://nvidianews.nvidia.com/news/nvidia-launches-vera-cpu-purpose-built-for-agentic-ai
[17] NVIDIA Vera CPU Product Page — https://www.nvidia.com/en-us/data-center/vera-cpu/
[18] Spheron, “NVIDIA Rubin CPX Long-Context Inference” — https://www.spheron.network/blog/nvidia-rubin-cpx-long-context-inference/
[19] NVIDIA Developer Blog, “NVIDIA Rubin CPX Accelerates Inference Performance” — https://developer.nvidia.cn/blog/nvidia-rubin-cpx-accelerates-inference-performance-and-efficiency-for-1m-token-context-workloads/