NVIDIA Vera Rubin Production

Vera Rubin Enters Full Production: What Mid-Market AI Buyers Need to Know

NVIDIA Vera Rubin is in full production. Here is what mid-market AI buyers need to know about cluster planning, cloud economics, and Rubin readiness.

Author
Josh Gelata
Infrastructure Lead
Arc Compute
Connect on LinkedIn

NVIDIA confirmed at GTC Taipei on June 1, 2026 that the Vera Rubin platform has ramped into full production. According to NVIDIA, 150 supply chain partners in Taiwan, more than 350 factories, and 30 countries are now manufacturing Vera Rubin systems at scale, with first shipments expected in the second half of 2026.

For hyperscalers, the upgrade math is simple. They place forward orders measured in millions of GPUs, and inference already runs in dedicated gigawatt facilities.

For mid-market AI buyers, the calculus is harder. Vera Rubin resets the baseline of what counts as competitive AI infrastructure. It also reopens cloud OpEx questions that many teams thought were settled when a credit card and a hyperscaler region were enough to run experiments.

This blog walks through what Vera Rubin actually changes, where cloud math now breaks for sustained inference, and how mid-market teams (including AI startups scaling from Series A throughe) can plan a clean readiness path.

What did NVIDIA announce with Vera Rubin?

NVIDIA Vera Rubin is a rack-scale agentic AI platform pairing the new 88-core Vera CPU with Rubin GPUs in an openModular GPU Architecture (MGX) design.

The flagship Vera Rubin NVL72configuration delivers 3.6 NVFP4 ExaFLOPS of inference and 1.2 FP8 ExaFLOPS of training per rack, according to NVIDIA. 1 Full production was confirmed at GTC Taipei on June 1, 2026.

This is more than a faster chip. NVIDIA is shipping a rack-scale design built end-to-end for agentic workloads, where models reason, plan, and act across long contexts. That changes how cluster planners think about memory, fabric, and power as one connected design.

 

VeraRubin NVL144 platform at a glance

Component Specification
Vera CPU 88 cores, 1.2 TB/s LPDDR5X bandwidth, 3.6 TB/s on-chip fabric
Rubin GPU 288 GB HBM4, 50 FP4 PetaFLOPS per GPU, 22 TB/s memory bandwidth 2
NVL72 inference 3.6 NVFP4 ExaFLOPS per rack
NVL72 training 1.2 FP8 ExaFLOPS per rack
Agent throughput 10x vs Grace Blackwell
Power per rack 190 to 230 kW (NVL72), up to 600 kW (Rubin Ultra, 2027)
Cooling 100% direct liquid cooling at 45 degrees Celsius
Fabric 800G Ethernet with co-packaged optics (Spectrum-X Photonics)
Production status Full production confirmed June 1, 2026, first shipments H2 2026
Supply chain 150 partners in Taiwan, 350+ factories, 30 countries

Source: NVIDIA Newsroom (June 1, 2026), Introl analysis, Tom's Hardware deep-dive.

Why does Vera Rubin matter for mid-market AI buyers?

Vera Rubin matters for mid-market AI buyers because it raises the floor for production AI infrastructure. Live inference already consumes 80% to 90% of active AI data center compute, according to Goldman Sachs analysis. Vera Rubin is the first platform built end-to-end for that workload pattern, and the planning window for mid-market adoption is narrow.

The economic context matters. According to Futurum Group, global AI capital expenditure is on track to hit $690 billion in 2026, with cumulative AI infrastructure spending crossing $1 trillion before 2031. Industry supply analyses indicate forward orders from Microsoft, Google, Meta, and Amazon have absorbed the bulk of NVIDIA’s Blackwell allocation through 2026 and into 2027. And in January 2026, AWS raised pricing on EC2 Capacity Blocks for ML by roughly 15%, a rare straight price increase on a GPU reservation service, as reported by The Register.

Translation: mid-market AI teams competing for cloud GPU capacity are queueing behind hyperscalers with multi-billion-dollar forward contracts. Vera Rubin allocation is going to follow the same pattern. Teams that engage Original Equipment Manufacturer (OEM) partnerships ahead of the ramp are positioned for the first production waves. Teams waiting on reseller quotes face a longer queue.

Where does cloud math break for sustained inference?

Cloud math breaks for sustained inference somewhere between 40% and 50% GPU utilization. Above that line, owning the hardware tends to pencil out over a 24 to 36 month horizon. Cloud was engineered for bursty workloads, and agentic AI is the opposite of bursty.

The market is responding: in a 2026 survey of enterprise IT leaders commissioned by storage vendor Cloudian, 79% of respondents said they had already moved at least some AI workloads off public cloud.

79%
Already moved some AI workloads off public cloud
73%
Plan to shift further to on-prem or hybrid by 2028

Source: 2026 survey of enterprise IT leaders commissioned by Cloudian.

The drivers are consistent across verticals:

  • Predictable cost: 40% of enterprises report cloud AI spending exceeds initial projections, per the same Cloudian-commissioned survey
  • Latency: 75% of survey respondents identified workloads that require on-prem deployment for acceptable latency
  • Data sovereignty: Region controls in public cloud satisfy data residency, but a US-parented provider remains within reach of US law under the CLOUD Act regardless of where data sits. For workloads governed by the European Union (EU) AI Act, the Health Insurance Portability and Accountability Act (HIPAA), or the General Data Protection Regulation (GDPR), that unresolved legal exposure pushes sensitive inference toward infrastructure the organization controls
  • GPU supply: When hyperscalers absorb most of NVIDIA's allocation, mid-market buyers pay the spread on the spot market

For sustained inference, the right answer depends on workload pattern, regulatory exposure, and egress cost at scale. A CEO's view of these tradeoffs lives in Arc Compute's earlier piece on AI inference economics, and the rise of Private AI Cloud as the default path for enterprises moving off public infrastructure is covered in our analysis of the current repatriation wave.

A useful internal test: track the cost per million tokens served from your own infrastructure against the same workload billed to a hyperscaler. Once the gap holds above 40% across two quarters, the financial case for ownership rarely reverses.

How should mid-market teams plan for Vera Rubin readiness?

Mid-market teams should plan for Vera Rubin readiness on three parallel tracks: facility upgrades for 100% liquid cooling at densities up to 600 kW per rack, fabric sequencing for 800G Ethernet, and procurement timing that recognizes both the 9 to 12 month lead time from order to deployment and HBM4 memory supply that is already constrained across every supplier.

Teams ordering early are positioned for the first allocation waves. Later orders push further out. For most mid-market data centers, three planning conversations need to happen now, before first shipments land:

  • Facility readiness. Can your existing room handle 190 to 230 kW racks 3 with direct liquid cooling? If not, what is the path to a liquid-ready zone or a colocation partner?
  • Architecture sequencing. Most mid-market teams will run NVIDIA HGX B200 and HGX B300 systems through the current cycle, then layer Vera Rubin in for next-generation agentic models. Planning a sequenced fleet matters more than chasing the newest silicon.
  • Procurement timing. Cloud can pivot faster but pays the premium for it. Teams that order 9 to 12 months ahead of need land owned capacity on schedule. Later decisions queue behind earlier ones.

Cloud vs on-prem decision factors for the Vera Rubin era

Factor Public cloud On-prem / Private AI Cloud
Best for Bursty, experimental workloads Sustained inference above 40% utilization
Cost trajectory OpEx, unpredictable above scale CapEx, 24 to 36 month breakeven
Vera Rubin access Behind hyperscaler forward orders Direct OEM allocation when planned 9 to 12 months ahead
Latency Variable, region-dependent Deterministic, low jitter
Data sovereignty Residency solvable via region controls; provider stays under US jurisdiction Full jurisdictional control, air-gapped options
Egress cost Charged per GB None

Sources: 2026 enterprise survey commissioned by Cloudian, Arc Compute deployment data.

Arc Compute has covered these tradeoffs in detail in our Blackwell, Hopper, or Wait for Vera Rubin buying guide and our analysis of why liquid cooling is becoming a structural requirement above 50 kW per rack.

How does Private AI Cloud fit into a Vera Rubin-era strategy?

Private AI Cloud fits a Vera Rubin-era strategy as the inference anchor. It pairs the ownership economics and data control of on-premises infrastructure with the consumption model of cloud, which is the same hybrid pattern Meta is following at hyperscaler scale: a multiyear, multigenerational commitment to millions of NVIDIA Blackwell and Vera Rubin GPUs across on-premises and cloud, announced in February 2026 and backed by a capital expenditure forecast of $125 billion to $145 billion. 4 Few signals say more clearly that pure cloud is rarely the whole answer at production scale.

What can AI startups learn from Meta's hybrid playbook?

AI startups can learn from Meta’s hybrid playbook that pure cloud becomes hard to justify at production scale. With inference now 80% to 90% of compute, the cost curve favors ownership once monthly GPU spend crosses seven figures, traffic becomes sustained rather than bursty, and regulated customers start asking about data residency.

Three signals that an AI startup has outgrown public cloud:

  • Monthly cloud GPU spend has crossed seven figures and is climbing month over month
  • Inference traffic is sustained rather than bursty (peaks are within 2x of trough)
  • Regulated customers in finance, healthcare, or EU markets are asking about data residency during onboarding

When any two of those signals hit at the same time, the conversation shifts from "can we afford on-prem?" to "can we afford not to own this?" The practical sequence for founders: keep training and burst capacity in the public cloud, anchor sustained inference on dedicated infrastructure, and let the cap table benefit from the longer asset life.

Arc Compute has framed this decision tree in our practical framework for AI startups thinking about moving off the cloud.

Planning the next 18 months

Vera Rubin lands at a moment when mid-market AI teams are being asked to deliver production results on infrastructure budgets that were never designed for sustained inference. The right question is not whether to upgrade. It is how to sequence Blackwell, Vera Rubin, and hybrid capacity into an architecture that lasts more than one model generation.

For teams working through that sequencing now, the Beyond Blackwell: Rubin Readiness Playbook walks through facility, fabric, and procurement timing decisions in depth. And because Rubin allocation will reward whoever planned earliest, the sequencing itself is worth pressure-testing against people who deploy these systems.

Arc Compute’s GPU experts can map facility readiness and order timing against your actual workload mix before the allocation window narrows.

The hyperscalers have already placed their bets. Mid-market AI buyers have a narrow window to plan theirs.

Sources

About the Author
Josh Gelata
Infrastructure Lead
Arc Compute

Josh leads infrastructure planning and delivery at Arc Compute, working with enterprise data centers, sovereign clouds, and AI labs to plan and deploy GPU systems that move from purchase order to production workload on real-world timelines.

Connect on LinkedIn
Continue Your Research

Explore Other related resources