RTX 4090 crashes during training

RTX 4090 Crashes During Training: Power Supply or Thermal Issues?

Author: ITCT Tech Editorial Unit
Reviewer: Hardware Infrastructure Team
Last Updated: February 28, 2026
Read Time: 4 minutes
References:

  • ATX 3.0 Power Supply Design Guidelines (Suggested integration for transient power spike claims)
  • Nvidia System Management Interface (nvidia-smi) Documentation
  • PCI-SIG 12VHPWR Connector Specifications (Suggested reference for cable seating warnings)

Quick Answer- RTX 4090 crashes during training

The Nvidia RTX 4090 is a powerhouse for AI training, but its massive computing capabilities often lead to mid-training system crashes due to sustained workloads. These disruptions generally stem from two primary bottlenecks: power delivery failures or thermal saturation. Unlike dynamic gaming workloads, machine learning models place a continuous, heavy load on the GPU. This can cause transient power spikes that trip older power supplies, or it can generate sustained heat that gradually overwhelms standard chassis cooling, leading to hardware protective shutdowns or driver failures.

To quickly stabilize your system, you must first identify the symptom. If your system experiences hard reboots or the screen goes black while fans spin at maximum speed, the issue is likely a tripped ATX 2.x power supply or a poorly seated 12VHPWR cable. Conversely, if you encounter CUDA errors, driver timeouts (TDR), or system lockups after several hours of operation, the culprit is usually thermal throttling at the VRAM or GPU hot spot. In both scenarios, the most effective immediate workaround is applying a strict power limit (e.g., 360W via nvidia-smi), which drastically reduces both thermal output and transient power spikes without significantly impacting training performance.


RTX 4090 crashes during training – Nvidia’s GeForce RTX 4090 is widely seen as the top consumer GPU for AI and machine learning. With 24GB of GDDR6X and a huge amount of Tensor throughput, it can train deep learning models extremely fast. Yet many users run into the same painful issue: the system crashes mid-training—often hours into an epoch. The display may suddenly go black, fans may jump to 100%, or the entire PC may reboot without warning.

When that happens, the troubleshooting usually comes down to two main suspects: power delivery problems or thermal saturation. AI workloads stress the 4090 in a different way than gaming, and understanding that difference is the key to stability.

The Power Delivery Problem (Often Behind Hard Reboots)

Deep learning training is a sustained, high-load scenario. Unlike games—where GPU power draw rises and falls depending on the scene—training keeps compute and memory heavily loaded for long periods. That’s where your PSU and cabling get tested the most.

Sustained draw + transient spikes

The RTX 4090’s rated Total Graphics Power is around 450W, but it can produce very fast transient spikes well above that (sometimes over 600W for tiny fractions of a second).

Some older ATX 2.x power supplies can interpret these spikes as an overcurrent event. Their protection circuitry (OCP) may react by shutting the system down instantly—resulting in a sudden reboot or black screen.

The 12VHPWR connector factor

The 4090 relies on the 12VHPWR power connector (or an adapter). This connection is extremely sensitive to:

  • incomplete seating (not fully clicked in)
  • sharp bends near the plug
  • poorly made adapters or cables

If the connection isn’t perfect, resistance increases—under heavy sustained load that can cause heat, voltage drop, instability, driver resets, or a full crash.

Typical power-related symptoms

  • instant reboot (as if power was cut)
  • screen goes black and system restarts
  • fans jump to max, then reboot or shutdown
  • crashes happen quickly under load, not only after hours

The Thermal Bottleneck (More Likely Behind Driver Timeouts / Lockups)

Thermals can also break long training runs—but they often fail “softer” than power issues.

AI training heats core + VRAM continuously

Training keeps VRAM and GPU hot for long stretches. Many PC cases are designed for gaming bursts, not workstation-style sustained heat output. Over hours, the case ambient temperature can rise, reducing cooler effectiveness.

The GPU may begin throttling heavily as it nears limits—especially on:

  • VRAM / memory junction
  • hot spot temperature

If heat can’t be removed fast enough, the driver may trigger a TDR (Timeout Detection and Recovery). In practice, this can look like:

  • CUDA errors
  • driver reset
  • training process crash
  • OS becoming unresponsive

Typical thermal-related symptoms

  • crash happens after long runtime (heat soak)
  • CUDA timeout / driver reset messages
  • gradual slowdown (throttling) before failure
  • system may not fully reboot, but the training job dies

A Practical Way to Identify the Culprit: Apply a Power Limit

A very effective diagnostic step is limiting GPU power. This does two things at once:

  1. reduces transient spikes (helps PSU/OCP issues)
  2. reduces heat output dramatically (helps thermal issues)

A common “sweet spot” is about 80% of the 4090’s maximum power:

In many training workloads, performance barely drops because the 4090’s efficiency curve is strong—while stability improves a lot.

Linux command (before training)

# Limit the GPU to 360W to reduce spikes and thermals
sudo nvidia-smi -pl 360

If crashes disappear after setting a power limit, you’ve learned something important:

  • if stability improves massively, power delivery and/or thermals were likely the trigger
  • then you can decide whether to fix power (PSU/cables) or cooling (airflow/fans) or both

What to Do Next (Targeted Fixes)

If you suspect power delivery

  • Use a high-quality ATX 3.0 PSU with a native 12VHPWR cable
  • Aim for 1000W or higher (more if the CPU is also power-hungry)
  • Ensure the connector is fully seated (flush, click engaged)
  • Avoid bending the cable sharply near the plug

If you suspect thermals

  • Log Memory Junction Temperature and Hot Spot during training
  • Improve case airflow (intake + exhaust balance)
  • Consider a more aggressive fixed fan curve during long runs
  • Reduce GPU power limit (often the best “no-hardware-change” fix)

If you suspect driver stability issues

  • Prefer Nvidia Studio or Enterprise drivers over Game Ready for long compute sessions
  • Keep CUDA toolkit / driver versions consistent with your framework (PyTorch/TensorFlow) compatibility

Conclusion

RTX 4090 training crashes usually come from one of two places:

  • Hard reboot / black screen + fans maxing out often points to PSU protection trips or 12VHPWR connection issues.
  • Driver timeouts, CUDA errors, gradual lockups are more often caused by thermal throttling, especially VRAM/memory junction heat during long, sustained runs.

The fastest path to stability is typically: verify power cabling/PSU quality + apply a sensible power limit + improve airflow. With those changes, the 4090 becomes far more reliable for long AI training sessions.

If you tell me your PSU model, case, and what exactly happens at crash time (reboot vs driver reset, any logs), I can narrow it down and suggest the most likely fix.

«In most scenarios involving immediate system shutdowns during deep learning model training, transient power spikes are tripping the overcurrent protection on older power supplies. Upgrading to a native ATX 3.0 unit usually resolves this instability.» — Hardware Infrastructure Team

«Extended AI workloads generate sustained heat that typical gaming chassis struggle to exhaust. Monitoring the VRAM memory junction temperature is critical, as thermal saturation often precedes CUDA errors and driver timeouts.» — AI Server Maintenance Team

«Applying a strict power limit, typically around 80 percent or 360W, is generally the most effective immediate workaround. It drastically mitigates both thermal output and power spikes with a negligible impact on overall training times.» — Data Science Operations Team


Last update at December 2025

Leave a Reply

Your email address will not be published. Required fields are marked *