5th Year Master's Thesis Presentation - Jinqi Chen August 14, 2025 2:00pm — 3:30pm Location: In Person - Traffic21 Classroom, Gates Hillman 6501 Speaker: JINQI CHEN , Master's StudentComputer Science DepartmentCarnegie Mellon University https://www.linkedin.com/in/jinqichen Towards Effortless High-Performance Kernel Development for LLM Workloads Recent advances in large language models (LLMs) have pushed GPU hardware to its limits, requiring highly optimized kernels for compute- and bandwidth-intensive operations such as normalization, matrix multiplication, attention, and inter-GPU communication. However, achieving state-of-the-art efficiency often demands deep low-level expertise, slowing development and limiting accessibility.This thesis presents TIR+, a multi-level compiler framework that unifies high-level productivity and low-level optimization within a single compilation and runtime infrastructure. TIR+ spans from a Python-based tiling DSL, enabling rapid kernel prototyping, to a hardware-centric intermediate representation (IR), offering fine-grained control over memory, parallelism, and specialized instructions. Between these extremes, it provides optimized tensor libraries and reusable primitives inspired by CUTLASS and CuTe. Crucially, TIR+ is distributed-aware, supporting multi-GPU execution with built-in communication management and compute–communication overlap. We demonstrate the capability of TIR+ through key LLM kernels, including LayerNorm/RMSNorm, GEMM, FlashAttention-style attention, and combined compute–communication kernels. Among these cases, TIR+ delivers near–state-of-the-art throughput with significantly less development effort than hand-tuned CUDA, demonstrating a unified and scalable path toward hardware-aware kernel optimization for current and future AI workloads.Thesis CommitteeTianqi Chen (Chair)Zhihao JiaAdditional Information For More Information: amalloy@cs.cmu.edu Add event to Google Add event to iCal