Doctoral Speaking Skills Talk - Daiyaan Arfeen

April 17, 2025 3:00pm — 4:00pm

Location:
In Person - Gates Hillman 7101

Speaker:
DAIYAAN ARFEEN , Ph.D. Student, Computer Science Department, Carnegie Mellon University
https://csd.cmu.edu/people/doctoral-student/daiyaan-arfeen

Nonuniform-Tensor-Parallelism: Mitigating GPU failure impact for Scaled-up LLM Training

LLM training is scaled up to 10Ks of GPUs by a mix of data-(DP) and model-parallel (MP) execution. Critical to achieving efficiency is tensor-parallel (TP; a form of MP) execution within tightly-coupled subsets of GPUs, referred to as a scale-up domain, and the larger the scale-up domain the better the performance. New datacenter architectures are emerging with more GPUs able to be tightly-coupled in a scale-up domain, such as moving from 8 GPUs to 72 GPUs connected via NVLink.

Unfortunately, larger scale-up domains increase the blast-radius of failures, with a failure of single GPU potentially impacting TP execution on the full scale-up domain, which can degrade overall LLM training throughput dramatically. With as few as 0.1% of GPUs being in a failed state, a high TP-degree job can experience nearly 10% reduction in LLM training throughput. We propose nonuniform-tensor-parallelism (NTP) to mitigate this amplified impact of GPU failures. In NTP, a DP replica that experiences GPU failures operates at a reduced TP degree, contributing throughput equal to the percentage of still-functional GPUs.

We also propose a rack-design with improved electrical and thermal capabilities in order to sustain power-boosting of scale-up domains that have experienced failures; combined with NTP, this can allow the DP replica with the reduced TP degree (i.e., with failed GPUs) to keep up with the others, thereby achieving near-zero throughput loss for large-scale LLM training.

Presented in Partial Fulfillment of the CSD Speaking Skills Requirement

Add event to Google
Add event to iCal

About Main page

Admissions Main page

Academics Main page

People Main page

Research Main page

Doctoral Speaking Skills Talk - Daiyaan Arfeen

April 17, 2025 3:00pm — 4:00pm