405B Model Training Faces Numerous Challenges: NVIDIA GPUs Frequently Fail, Meta Engineers Respond Ingeniously

GPUs and their high-bandwidth memory are the main causes of more than half of the failures.

58.7% of Unexpected Interruptions Caused by GPUs, Three Incidents Required Significant Manual Intervention

According to reports, during the 54-day pre-training period, there were a total of 466 work interruptions. Among them, 47 were planned interruptions due to automated maintenance, such as firmware upgrades or operator-initiated configuration updates or dataset update operations; 419 were unexpected interruptions, mainly stemming from confirmed hardware issues, including GPU and host component failures or suspected hardware-related problems, such as silent data corruption and unplanned single-host maintenance events.

GPU issues were the primary category of unexpected interruptions, accounting for 58.7% of all unexpected problems, including various GPU failures such as NVLink and HBM3 memory failures. This is not surprising, as Nvidia's H100 GPU consumes about 700W and endures significant thermal stress. Despite the large number of failures, only three incidents required significant manual intervention, with the remaining issues being handled automatically.

The other 41.3% of unexpected interruptions were caused by a combination of software errors, network cables, and network adapters. Interestingly, only two CPUs failed during this period.

The root causes of unexpected interruptions were categorized during the 54-day Llama 3 405B pre-training period.

Another challenge faced by the Llama 3 405B large model training team was the simultaneous power consumption changes of tens of thousands of GPUs, which put pressure on the data center's power grid.

During training, thousands of GPUs may simultaneously increase or decrease power consumption, for example, when waiting for checkpoints to complete or collective communications to end, or during the startup or shutdown of the entire training task. When this occurs, it can cause instantaneous fluctuations in data center power consumption on the order of tens of megawatts, potentially overwhelming the power grid.

This is an ongoing challenge, meaning Meta must ensure its data centers have sufficient power to maintain normal operation of the 405B model and future larger-scale Llama models. As AI models continue to grow in complexity, the required computational resources are also increasing.

Efforts Behind Achieving 90% Effective Training Time

To improve efficiency, Meta developed various tools and optimization strategies, including reducing task startup and checkpoint times, extensive use of PyTorch's built-in NCCL flight recorder, and identifying lagging GPUs. Among these, NCCLX played a crucial role in fault detection and localization, especially for NVLink and RoCE-related issues, with PyTorch integration allowing monitoring and automatic timeouts for communication stalls caused by NVLink failures.

It is understood that PyTorch's NCCL flight recorder can record collective metadata and stack traces to a ring buffer, enabling rapid diagnosis and resolution of hang and performance issues at scale, especially those related to NCCLX. Additionally, Meta's mixed use of NVLink and RoCE in the network made debugging issues in large-scale training more complex. Data transfers through NVLink are typically completed through load/store operations issued by CUDA kernels, and failures of remote GPUs or NVLink connections usually manifest as stalled load/store operations within CUDA kernels, without returning explicit error codes.

NCCLX improved the speed and accuracy of fault detection and localization through close collaborative design with PyTorch, allowing PyTorch to access NCCLX's internal state and track relevant information. While it's not possible to completely prevent hangs caused by NVLink failures, the system monitors the communication library's status and automatically times out when such hangs are detected. Furthermore, NCCLX tracks kernel and network activity for each NCCLX communication and provides internal state snapshots of failed NCCLX collectives, including completed and pending data transfers between all ranks.

Sometimes, hardware issues can lead to "stragglers" that are still running but at a slow pace, which are difficult to detect. Even a single "straggler" can slow down thousands of other GPUs, often manifesting as normal but slow communication. To address this, Meta developed tools to prioritize potential problem communications from selected process groups, effectively detecting and promptly resolving laggards, ensuring slowdowns are minimized and overall training efficiency is maintained.

Another interesting observation is the impact of environmental factors on large-scale training performance. For Llama 3 405B, Meta noticed a 1-2% throughput variation at certain times of the day, which was due to higher midday temperatures affecting GPU dynamic voltage and frequency scaling, thus impacting training performance. However, this is not a major issue, as GPU dynamic voltage and frequency scaling are typically affected by temperature changes.

Conclusion

Considering that a cluster with 16,384 H100 GPUs experienced 419 unexpected failures over 54 days, or 7.76 times every 24 hours, we can't help but wonder how often xAI's Memphis Supercluster, equipped with 100,000 H100 GPUs, might experience failures?

Last week, Elon Musk boasted on the social platform X about launching "the world's most powerful AI training cluster," stating he would create "the world's most powerful AI by all metrics" before December this year. It is reported that the Memphis Supercluster has already begun training, using liquid cooling and a single RDMA network interconnect architecture.

Proportionally, xAI's Memphis Supercluster may face an exponentially higher failure rate, with the number of failing components potentially increasing sixfold, posing greater challenges for its future AI training.

Reference Links:

https://www.inspire2rise.com/meta-faces-frequent-gpu-failures-llama-3-training.html

https://www.tomshardware.com/tech-industry/artificial-intelligence/faulty-nvidia-h100-gpus-and-hbm3-memory-caused-half-of-the-failures-during-llama-3-training-one-failure-every-three-hours-for-metas-16384-gpu-training-cluster

https://ai.meta.com/research/publications/the-llama-3-herd-of-models/