Blog

NVIDIA H100 80GB (Hopper) Deep-Dive

NVIDIA H100 80GB

The NVIDIA H100 80GB, built on the Hopper architecture, is a powerful datacenter GPU designed to accelerate large-scale AI training and inference, as well as HPC and data analytics. It features advanced technologies like fourth-generation Tensor Cores, a Transformer Engine with native FP8 support, and high-bandwidth NVLink for fast GPU communication. The SXM5 model offers ~3 TB/s memory bandwidth with HBM3, while the PCIe variant delivers ~2 TB/s with HBM2e and a 350 W power envelope. The H100 sets new performance records on MLPerf Training v4.0, showcasing excellent scalability and efficiency, making it ideal for large language models, generative AI, recommenders, bioinformatics, and multi-tenant inference.

Architectural overview

NVIDIA’s Hopper architecture is purpose-built to accelerate transformers and next-generation AI workloads. It introduces a dedicated Transformer Engine tightly integrated with fourth generation Tensor Cores to dynamically mix FP8 and FP16 compute, automatically handling casting and scaling on a per layer basis. This yields large step function speedups on LLMs—up to 9× faster training and up to 30× faster inference versus the prior generation A100—while maintaining accuracy where it matters most.

Under the hood, Hopper augments the memory hierarchy and interconnect: the H100 80GB SXM5 delivers HBM3 and roughly 3 TB/s of bandwidth, and fourth generation NVLink Provides up to 900 GB/s of total bidirectional GPU-to-GPU bandwidth, minimizing all-reduce delays and pipeline stalls during multi-GPU training.The architecture also adds DPX instructions to accelerate dynamic programming kernels common in genomics and path planning, delivering up to 7× speedups over A100 for workflows like Smith–Waterman and Floyd–Warshall. Finally, the second generation of Multi Instance GPU (MIG) increases per instance compute and memory throughput while enabling confidential computing isolation down to the MIG slice.

Buy NVIDIA H100 80GB PCIe

Product Configurations: H100 80GB SXM5 vs. PCIe and NVL Variants

H100 80GB SXM5 vs H100 80GB PCIe (and the NVL tangent) The H100 80GB comes in two primary form factors that target different deployment constraints. The SXM5 module is the performance leader for scale‑up training with HBM3 memory and NVLink/NVSwitch integration across 4–8 GPUs in HGX nodes, enabling up to 900 GB/s per‑GPU interconnect bandwidth to peers. The PCIe card favors broad server compatibility and power envelopes, using HBM2e memory and PCIe Gen5 x16 connectivity; it also supports three NVLink bridges for select multi‑GPU topologies. It is important to distinguish the H100 NVL variant, which is a dual‑GPU PCIe solution optimized for LLM inference with 94 GB HBM3 per GPU and NVLink bridging—this is not the standard 80 GB SKU and is targeted at high‑capacity inference deployments.

Core Specifications That Matter for Builders

At the silicon level, Hopper scales the SM array and memory systems for high‑throughput linear algebra and collective comms. The complete GH100 design features 144 SMs, with the production H100 SXM5 model including 132 SMs, and the H100 PCIe variant housing 114 SMs. The SXM5 80GB variant pairs those SMs with HBM3 across five stacks to deliver approximately 3,000 GB/s of bandwidth, whereas the PCIe 80GB card uses HBM2e delivering approximately 2,000 GB/s.

For multi‑GPU systems, Hopper’s NVLink v4 increases link count and bandwidth to 18 links per GPU for an aggregate 900 GB/s, substantially raising the ceiling for all‑reduce and model‑parallel transfers. On the PCIe card, the board TDP is specified at 350 W and the interface is PCIe Gen5 x16 (with support for Gen5 x8/Gen4 x16), enabling deployment in mainstream Gen5 servers while retaining optional NVLink bridge support.

H100 80GB Variants and NVL Context

The NVIDIA H100 80GB GPU comes in multiple configurations optimized for different use cases, from peak training performance to flexible deployment and high-density inference. The table below summarizes the key specifications and target applications for each variant.

Variant SMs Memory Memory Bandwidth Interconnect Board Power (TDP) Target Use Case
H100 SXM5 80GB 132 80 GB HBM3 ~3 TB/s NVLink v4 up to 900 GB/s via NVSwitch Peak training throughput, HGX nodes
H100 PCIe 80GB 114 80 GB HBM2e ~2 TB/s PCIe Gen5 x16; supports 3 NVLink bridges 350 W Broad compatibility, flexible topologies
H100 NVL 94GB 94 GB HBM3 per GPU NVLink bridge (dual GPU PCIe card) Large LLM inference density on PCIe platforms

 

Transformer Engine and FP8

why it changes the LLM efficiency curve The Transformer Engine is Hopper’s hallmark for AI: it continuously adapts precision to maximize throughput while maintaining accuracy, shifting between FP8 and FP16 within and across layers. FP8 support in H100 Tensor Cores includes two formats—E4M3 (wider mantissa for precision) and E5M2 (wider exponent for range)—and these can be mixed so, for example, forward activations and weights use E4M3 while gradients use E5M2.

NVIDIA’s Transformer Engine offers an AMP-style API for PyTorch and a framework-agnostic C++ API, managing delayed scaling, calibration windows, and per-tensor strategies to ensure stable optimization. These capabilities are central to the reported up to 9× training and up to 30× inference speedups on large transformers versus A100, especially as sequence lengths and parameter counts expand.

Multi-Instance GPU (MIG) on H100

H100’s second-generation MIG enables up to seven fully isolated GPU instances per card, each with dedicated memory, cache, and compute, plus improved performance monitoring and hardware-backed confidential computing. For multi‑tenant inference, MIG can dramatically increase utilization by right‑sizing instances to model footprints and SLOs while isolating workloads. Relative to A100, Hopper’s MIG delivers roughly 3× more compute capacity per instance and nearly 2× more memory bandwidth per instance, making fractional GPU serving of LLMs, diffusion models, or streaming recommenders much more efficient in shared clusters.

NVLink v4 and memory bandwidth

In multi-GPU training, the runtime can be heavily impacted by collective operations and activation transfers. The H100 80GB Hopper architecture addresses the interconnect bottleneck via fourth generation NVLink with 18 links per GPU, reaching an aggregate 900 GB/s per GPU—about 7× PCIe Gen5 bandwidth. This significantly improves data parallel all reduce and model/pipeline parallelism, where frequent inter-GPU exchange of large activation tensors occurs.

In parallel, the SXM5 variant of the H100 80GB features an HBM3 memory system providing about 3 TB/s of bandwidth, keeping tensor cores fully fed during wide GEMMs and attention blocks at long sequence lengths. The PCIe variant’s HBM2e provides about 2 TB/s of bandwidth, making it well-suited for mainstream use cases and inference workloads. Together, these capabilities enable the strong near-linear scaling behavior observed in large-scale MLPerf runs.

MLPerf Training v4.0

objective signals of performance and scale H100 systems set multiple records in the latest MLPerf Training v4.0 round. Using 11,616 H100 GPUs, NVIDIA completed training of GPT-3 175B in only 3.4 minutes, showcasing almost perfect scaling efficiency. At more moderate 512‑GPU scale, time‑to‑train fell by 27% year‑over‑year to under an hour, with per‑GPU utilization reported at 904 TFLOP/s. Fine‑tuning Llama 2 70B with LoRA completed in just over 28 minutes on a single DGX H100 (8 GPUs) and in 1.5 minutes at 1,024 GPUs.

For Stable Diffusion v2 training, H100 delivered up to 80% higher throughput at 1,024‑GPU scale versus the prior submission at the same scale. And H100 set a new record on a large‑scale graph neural network workload with a 1.1‑minute time‑to‑train at 512 GPUs. These results highlight both raw performance and the production maturity of the software stack.

Training cutting-edge LLMs, both pre-training and post-training.

For training very large transformers, H100 SXM5’s combination of HBM3 bandwidth and NVLink/NVSwitch fabric reduces communication overheads and maximizes math unit occupancy. The Transformer Engine further raises throughput by pushing most matmuls to FP8 with per‑layer adaptive scaling, while preserving accuracy in sensitive paths with FP16. In practice, these innovations underpin MLPerf results like GPT‑3 pre‑training in minutes at massive scale, and large‑model fine‑tuning completing within minutes on 1k‑GPU scale, illustrating both performance and elasticity across scales of hardware. Teams building GPT‑J/NeoX, Llama‑style, or mixture‑of‑experts architectures can expect substantial step‑ups moving from A100 80 GB to H100 80 GB.

High‑throughput inference

for LLMs and multi‑tenant serving On the inference side, H100’s FP8 path and fourth‑gen Tensor Cores speed LLM decoding and speculative decoding flows, particularly when combined with MIG for running numerous smaller sessions or NVLink-bridged pairs to handle larger context windows. The NVL 94GB variant is purpose‑built for PCIe platforms that need maximum per‑GPU memory for large LLMs while retaining NVLink between the pair. In secured multi‑tenant environments, Hopper’s MIG adds confidential computing at the instance level, isolating tenants and enabling safe GPU sharing without compromising telemetry. This is critical for cost‑optimized SaaS LLM endpoints.

Diffusion Models and Generative Vision

Stable Diffusion v2 training achieved up to an 80% performance boost at the same 1,024-GPU scale compared to earlier runs, driven by compiler, runtime, and kernel optimizations along with Hopper’s Transformer Engine and bandwidth gains.  For production inference of diffusion pipelines, MIG can carve H100 into multiple isolated instances to serve diverse workloads concurrently while maintaining latency SLOs—an efficient pattern for image generation APIs and A/B testing across model variants.

Bioinformatics, drug discovery, and dynamic programming (DPX)

Hopper introduces DPX instructions that accelerate dynamic programming—the computational core of algorithms like Smith–Waterman for sequence alignment and Floyd–Warshall for shortest paths. NVIDIA reports up to 7× speedups versus A100, translating into reduced time to insight for genome annotation, antibody design, or route optimization in robotics and logistics. Combined with MIG, bioinformatics services can hard‑partition GPUs for concurrent pipelines—e.g., alignment, variant calling, and ML‑based annotation—under strict isolation.

Graph neural networks and graph analytics

In MLPerf v4.0, H100 set a new record for large-scale GNN training, completing in just 1.1 minutes using 512 GPUs. The higher memory bandwidth and faster inter‑GPU communication reduce the usual bottlenecks in neighbor sampling and message passing, and the Transformer Engine can benefit transformer‑based GNN variants for molecular property prediction and social‑graph modeling.

Recommender systems and tabular

Modern recommender systems depend on large embedding tables and transformer blocks for re-ranking, which puts significant pressure on memory bandwidth and interconnect during training. H100’s HBM3 bandwidth on SXM5 and NVLink collectives help mitigate hot‑spotting and balance embedding all‑reduce traffic. At inference, FP8 matmuls and MIG partitioning allow service operators to scale QPS while right‑sizing instances to model memory footprints and latency bins—often boosting effective utilization in shared clusters.

Software stack

from FP8 recipes to cluster scale The practical route to FP8 on H100 is NVIDIA’s Transformer Engine library, which supplies highly optimized building blocks and an AMP‑like API so teams can drop FP8 into PyTorch with minimal refactoring. The library exposes recipes for delayed scaling, calibration windows, and per‑tensor precision choices to maintain model quality. At cluster scale, NVIDIA’s distributed training stack and networking pair with NVLink/NVSwitch to unlock near‑linear scaling on transformer workloads, as reflected in the 11,616‑GPU GPT‑3 result. For production environments, the second‑generation MIG integrates with orchestration platforms to manage fractional GPU scheduling with native isolation and telemetry.

Deployment patterns and platform choices

  • SXM5 in HGX servers for scale‑up training Use cases that demand maximum performance per node—e.g., pre‑training 70B+ parameter LLMs or long‑sequence models—benefit from SXM5 boards interconnected via NVLink/NVSwitch. The 900 GB/s NVLink bandwidth per GPU reduces latency and congestion during frequent collective operations and activation exchanges, enabling larger global batch sizes with minimized pipeline stalls.
  • PCIe support for versatile deployments: Ideal for mixed workloads and gradual upgrades, featuring 350 W TDP and Gen5 x16 interface. The H100 PCIe’s three NVLink bridges support local GPU‑GPU links in select chassis, while the HBM2e subsystem still provides about 2 TB/s bandwidth for strong inference and training throughput in mainstream servers. This approach suits enterprises transitioning from A100 PCIe setups or expanding inference capabilities alongside CPU-intensive services.
  • Grace Hopper (GH200) for CPU‑GPU coupled analytics NVIDIA’s Grace Hopper superchip marries a Grace CPU with Hopper GPU, targeting tightly coupled analytics and AI. While not an H100 80GB board SKU, it is part of the broader Hopper platform strategy for memory‑coherent CPU‑GPU designs and is relevant when evaluating long‑term infrastructure roadmaps around Hopper‑class accelerators.

Practical guidance: sizing, reliability, and isolation

  • Model sizing and memory With 80 GB per GPU, many 7B–13B class models train and infer comfortably without ZeRO‑offload; larger models typically rely on tensor/pipeline parallelism and activation checkpointing. The SXM5’s HBM3 bandwidth helps maintain arithmetic intensity for long contexts and many heads; the NVL 94 GB option exists to increase per‑GPU capacity in PCIe environments dedicated to LLM inference.
  • Communication patterns For data‑parallel training, leverage gradient compression only if convergence remains stable; Hopper’s NVLink bandwidth substantially reduces the need for aggressive compression in‑node. For model‑parallel pipelines, partition to minimize cross‑stage tokens per step and exploit NVLink paths; in PCIe fleets, consider topology‑aware placement and, where supported, NVLink‑bridged pairs.
  • MIG isolation and confidential computing In regulated or multi‑tenant environments, use H100’s MIG to hard‑partition GPUs and enable per‑instance trusted execution, ensuring isolation of memory, cache, and acceleration engines; this pairs well with per‑tenant observability via the new instance‑level performance monitors.

Competitive and generational context

Relative to A100, Hopper’s introduction of the Transformer Engine and FP8 path is the dominant differentiator for transformers, with published claims of up to 9× training and up to 30× inference speedups on large LLMs. NVLink bandwidth per GPU rises to 900 GB/s and link count to 18, compounding scaling gains. MIG capacity and isolation are also stronger on H100.

These changes show up not just in microbenchmarks but also in standardized MLPerf results, where H100 led across workloads and scales. As organizations plan refresh cycles, these generational improvements directly translate into fewer nodes for a fixed SLA or larger feasible model sizes at a given time‑to‑train target.

Risks, constraints, and when to choose each form factor

The SXM5 H100 80GB is the right choice when maximum per‑node throughput and inter‑GPU bandwidth are paramount—e.g., frontier LLM pre‑training or ultra‑long‑context fine‑tuning. However, it requires compatible HGX platforms and higher‑power thermal envelopes. The PCIe H100 80GB is ideal for heterogeneous fleets and incremental upgrades, particularly for high‑throughput inference and mixed workloads, with the option to NVLink‑bridge local pairs.

If your primary objective is high‑capacity LLM inference on PCIe servers, consider the H100 NVL (94 GB per GPU and an NVLink bridge) to reduce KV cache paging and improve batch‑throughput at long context windows. In all cases, MIG should be part of the capacity planning for multi‑tenant and spiky traffic patterns.

Visual appendix: Transformer Engine and Hopper highlights

DPX accelerators for dynamic programming

MLPerf GPT‑3 scaling on H100

Conclusion

The NVIDIA H100 80GB significantly advances AI training and inference with its cutting-edge architecture, featuring the Transformer Engine with FP8, fourth-generation Tensor Cores, NVLink v4, and HBM3 memory bandwidth. System enhancements like second-generation MIG with confidential computing boost efficiency and security. In standard benchmarks such as MLPerf Training v4.0, the H100 delivers record-breaking performance and excellent scalability.

Choosing between the SXM5 and PCIe form factors depends on platform and performance needs—SXM5 excels in large-scale training, while PCIe offers flexibility and high-throughput inference. Overall, the H100 80GB is a robust, forward-looking platform ideal for teams working on large language models, generative models, bioinformatics pipelines, and graph neural networks.

Leave a Reply

Your email address will not be published. Required fields are marked *