Parallel Data Laboratory Talk - Suhas Jayaram Subramanya August 7, 2024 12:00pm — 1:00pm Location: Virtual Presentation - ET - Remote Access - Zoom Speaker: SUHAS JAYARAM SUBRAMANYA, Ph.D. Student, Computer Science Department, Carnegie Mellon University https://suhasjs.github.io/ Sia: Heterogeneity-aware, goodput-optimized ML-cluster scheduling Large GPU clusters are increasingly becoming more heterogeneous due to advances in GPU design and incremental deployment of a mix of GPU types over time. Deep learning (DL) training jobs running on these GPU clusters can see varying job completion times depending on the resources allocated by the cluster scheduler and job hyper-parameters configured by users at the time of job submission. Sia is a cluster scheduler that (1) efficiently assigns heterogeneous GPU resources to elastic resource-adaptive DL training jobs, and (2) configures the job hyper-parameters to maintain high training efficiency for all running jobs without sacrificing the quality of trained models. We will discuss challenges in optimizing resource-adaptivity for deep learning training (DLT) jobs on large clusters with many GPU types, and introduce a new scheduling formulation that efficiently matches DLT jobs and their configurations to GPU types and counts, while adapting to changes in cluster load and job mix over time. On job traces derived from real datacenters, Sia improves job completion times by 30-93% while using 12-60% fewer GPU hours. Furthermore, its scheduling policy is quick to evaluate and easily scales to GPU clusters with many GPU types and 1000s of GPUs. — Suhas Jayaram Subramanya is a final-year PhD student in the CS Department, advised by Prof. Greg Ganger. His primary research area is deep learning systems. Zoom Participation. See announcement. Event Website: https://pdl.cmu.edu/SDI/index.shtml