
NVIDIA H100 NVL 94GB
The NVIDIA H100 NVL 94GB is a high-performance, PCIe-based dual-GPU solution designed specifically for large language model (LLM) inference workloads. Each GPU is equipped with 94 GB of HBM3 memory, and the two units are interconnected via NVLink, offering a total of 188 GB HBM3—ideal for serving models with up to 70 billion parameters on PCIe servers.
Built on the Hopper architecture, the H100 NVL utilizes the Transformer Engine to accelerate FP8 precision, significantly improving throughput and energy efficiency. It also supports second-generation Multi-Instance GPU (MIG) technology, enabling secure and efficient multi-tenant deployments.
While the H100 SXM5 remains the top-tier choice for scale-up training, the H100 NVL is optimized for environments that demand maximum memory per GPU, low-latency GPU-to-GPU communication, and cost-effective LLM inference on standard PCIe platforms.
What Is NVIDIA H100 NVL 94GB
The NVIDIA H100 NVL 94GB is a purpose-built, dual-GPU solution designed to accelerate large language model (LLM) inference on standard PCIe servers. Each Hopper-based GPU includes 94 GB of HBM3 memory, and the pair is connected via NVLink, delivering a combined 188 GB of ultra-fast memory. This configuration is optimized for serving models up to 70 billion parameters, such as those in the LLaMA family, where memory bandwidth and low-latency inter-GPU communication are essential for maximizing throughput and minimizing cost per query.
Unlike SXM5-based systems that require NVSwitch backplanes or HGX motherboards, the H100 NVL brings many of the same benefits—like peer-to-peer NVLink traffic—to more accessible PCIe platforms. NVIDIA positions the H100 NVL as a high-throughput inference engine for generative AI workloads, enabling larger batch sizes and key–value caches without compromising latency or scalability.
Key specifications of H100 NVL 94GB
Feature | Technical Specification |
---|---|
Product Type | Dual-GPU kit based on PCIe with NVLink interconnect |
Architecture | Hopper |
Memory per GPU | 94 GB HBM3 |
Total Memory (NVL Pair) | 188 GB HBM3 |
Form Factor | PCIe Gen5 |
Power Consumption (TDP) | 400 W |
NVLink Support | Yes (direct peer-to-peer communication between GPUs) |
MIG Support | Yes (Multi-Instance GPU for partitioning resources across multiple models) |
Primary Use Case | High-throughput inference for large language models (LLMs) up to ~70B parameters |
Hopper features that matter for NVL
Transformer Engine (FP8) and MIG Hopper’s Transformer Engine is central to NVL’s efficiency for LLM inference. It dynamically chooses FP8 or FP16 precision per layer, automatically handling re-casting and scaling so that sensitive operations preserve accuracy while the bulk of matmuls run at FP8’s higher throughput. The FP8 path uses two formats: E4M3 for higher precision (weights/activations commonly) and E5M2 for larger dynamic range (often gradients), and the library exposes AMP-like APIs for PyTorch and a C++ interface for other frameworks
. This allows operators to hit higher QPS at lower latency for decoder-heavy inference. Second-generation MIG on Hopper enables up to seven fully isolated instances per GPU with stronger per-instance performance and confidential-computing isolation, which is valuable for multi-tenant LLM endpoints or mixed model fleets.
NVLink peer-to-peer on PCIe
why it helps LLM serving The NVL kit’s NVLink bridge provides a low-latency, high-bandwidth path between the two GPUs, mitigating PCIe-only bottlenecks for activation and KV‑cache exchanges during decoding. This is especially useful for long-context models and batch-concurrency scenarios where the working set spans both devices. While the fourth-generation NVLink fabric’s peak 900 GB/s per GPU figure applies to SXM5 systems with full link counts and NVSwitch, NVL’s bridged pair still benefits from a much faster P2P path than PCIe alone for typical serving topologies. This approach brings a subset of SXM’s interconnect advantages to mainstream PCIe servers, targeting real-world LLM latency and throughput needs.
Serving patterns on H100 NVL 94GB
- Single large-model serving (paired GPUs) Map the model and its KV cache across the bridged NVL pair to exploit combined 188 GB HBM3 capacity. This reduces KV paging and supports longer context windows and/or larger batch sizes. The NVLink bridge lowers cross-device transfer latency relative to PCIe-only paths, improving tail latency. NVIDIA explicitly positions H100 NVL as “optimized for LLM inference” with 188 GB HBM3 for models around 70B parameters.
- Fractional GPU serving with MIG Use MIG to carve each NVL GPU into multiple instances that align with your SKU mix (for example, 7B/13B models) and latency/QPS classes. Hopper’s second‑gen MIG provides stronger per-instance performance and confidential computing isolation, making it suitable for multi-tenant endpoints while maximizing utilization. Instance-level performance monitors help maintain SLOs as traffic mixes vary.
- FP8 inference path via Transformer Engine Adopt FP8 for the heavy matmul paths using the Transformer Engine libraries. E4M3 and E5M2 can be assigned per tensor to balance dynamic range and precision, with delayed scaling and calibration features to protect accuracy. The result is a higher-throughput decode pipeline at lower latency and cost/QPS, especially for larger sequence lengths and beam/speculative decoding variants.
Comparing H100 veesions
GPU Model | Memory & Type | Power (TDP) | Interconnect | Primary Use Case | Infrastructure Requirements |
---|---|---|---|---|---|
H100 NVL 94GB (Dual) | 94 GB HBM3 × 2 = 188 GB | 400 W (per GPU) | NVLink Bridge (PCIe) | High-capacity LLM inference (~70B models) | Compatible with PCIe Gen5 servers |
H100 PCIe 80GB | 80 GB HBM2e | 350 W | Optional NVLink | General-purpose training/inference | Broad PCIe Gen5 compatibility |
H100 SXM5 80GB | 80 GB HBM3 | ~700 W | NVLink + NVSwitch (~900 GB/s) | Peak training performance, scale-up AI | Requires HGX-class systems |
Operational guidance and tips
- Topology and placement Keep the two NVL cards within the same host close to each other physically and thermally, ensuring the NVLink bridge is properly installed and airflow is adequate for sustained load (400 W per card). Affinitize the model runtime to the bridged pair to avoid unintended PCIe hops.
- MIG sizing for latency and QPS Start from your SLOs: low-latency interactive chat generally favors fewer, larger MIG slices; high-QPS batch endpoints can use more, smaller slices. Use the instance-level performance monitors (new in Hopper) for per-slice telemetry and automated scaling decisions.
- FP8 recipes and accuracy Use Transformer Engine’s delayed scaling and per-tensor calibration windows to stabilize FP8 inference. Retain higher precision for numerically sensitive ops (e.g., logits/softmax) while pushing the main matmuls to FP8. Validate with your own prompts and distribution to guard against drift.
- Capacity planning for KV cache The NVL pair’s 188 GB HBM3 provides headroom for larger context windows and concurrent sessions; budget KV cache carefully (tokens × layers × heads × d_head) and test across your most common context lengths to avoid paging. NVIDIA explicitly frames NVL as “optimized for LLM inference” with 188 GB capacity for models up to ~70B.
Broader Hopper context
what carries over from training benchmarks While NVL is positioned for inference, Hopper’s architecture is proven at extreme training scale. In MLPerf Training v4.0, H100 systems trained GPT‑3 (175B) to target in 3.4 minutes across 11,616 GPUs with near-linear scaling; a single DGX H100 fine-tuned Llama 2 70B (LoRA) in just over 28 minutes, and the same task dropped to 1.5 minutes at 1,024 GPUs. Stable Diffusion v2 training throughput rose as much as 80% at 1,024‑GPU scale over the prior submission. Although these results come from SXM5/HGX-class systems, they validate Hopper’s end-to-end stack maturity that NVL inherits on the inference side.
Conclusion
H100 NVL 94GB is a targeted solution for LLM inference on PCIe platforms: two Hopper GPUs bridged by NVLink, each with 94 GB HBM3, for a combined 188 GB footprint that aligns with 70B‑class models and long contexts. It couples Hopper’s Transformer Engine (FP8) for high-throughput inference with second‑gen MIG for secure multi-tenancy. If you need peak training throughput on a per-node basis, H100 SXM5 is the right fit; if you need flexible PCIe integration and superior LLM serving capacity per host, H100 NVL is designed for exactly that job.
Source: Nvidia