5th Year Master's Thesis Presentation - Jinqi Chen

— 3:30pm

Location:
In Person - Traffic21 Classroom, Gates Hillman 6501

Speaker:
JINQI CHEN , Master's StudentComputer Science DepartmentCarnegie Mellon University
https://www.linkedin.com/in/jinqichen

Towards Effortless High-Performance Kernel Development for LLM Workloads

Recent advances in large language models (LLMs) have pushed GPU hardware to its limits, requiring highly optimized kernels for compute- and bandwidth-intensive operations such as normalization, matrix multiplication, attention, and inter-GPU communication. However, achieving state-of-the-art efficiency often demands deep low-level expertise, slowing development and limiting accessibility.

This thesis presents TIR+, a multi-level compiler framework that unifies high-level productivity and low-level optimization within a single compilation and runtime infrastructure. TIR+ spans from a Python-based tiling DSL, enabling rapid kernel prototyping, to a hardware-centric intermediate representation (IR), offering fine-grained control over memory, parallelism, and specialized instructions. Between these extremes, it provides optimized tensor libraries and reusable primitives inspired by CUTLASS and CuTe. Crucially, TIR+ is distributed-aware, supporting multi-GPU execution with built-in communication management and compute–communication overlap. We demonstrate the capability of TIR+ through key LLM kernels, including LayerNorm/RMSNorm, GEMM, FlashAttention-style attention, and combined compute–communication kernels. Among these cases, TIR+ delivers near–state-of-the-art throughput with significantly less development effort than hand-tuned CUDA, demonstrating a unified and scalable path toward hardware-aware kernel optimization for current and future AI workloads.

Thesis Committee
Tianqi Chen (Chair)
Zhihao Jia

Additional Information

For More Information:
amalloy@cs.cmu.edu


Add event to Google
Add event to iCal