MSCS Thesis Defense - Charlie Ruan

— 11:00am

Location:
In Person - Reddy Conference Room, Gates Hillman 4405

Speaker:
CHARLIE RUAN , Master's Student, Computer Science Department, Carnegie Mellon University
https://www.linkedin.com/in/charlie-ruan

Democratizing On-Device LLM Inference with Machine Learning Compilers and Web Technologies

Large language models (LLMs) have traditionally relied on cloud-based inference due to their high computational and memory demands. However, recent advances in small LLMs and consumer hardware capabilities have made on-device inference increasingly practical. Among potential deployment targets, the web browser stands out as a uniquely compelling platform: it is universally accessible, naturally abstracts out hardware heterogeneity, requires no dependency installation for web applications, and provides a natural agentic environment for task automation. 

WebLLM is a high-performance TypeScript framework that enables LLM inference entirely within web browsers. WebLLM compiles LLMs ahead of time using the MLC-LLM and Apache TVM compiler stack to generate optimized WebGPU kernels and a portable WebAssembly runtime. The system exposes a familiar OpenAI-style API, supports efficient GPU acceleration, and integrates seamlessly with browser environments using Web Workers and WebAssembly. To enable structured generation, which is especially challenging for small LLMs, WebLLM incorporates XGrammar, an efficient grammar-constrained decoding engine, allowing developers to enforce output formats such as JSON or DSLs with near-zero overhead. Together, these components demonstrate a path toward democratizing LLM access, making intelligent, private, and responsive AI experiences universally available through the web. 

Thesis Committee

Tianqi Chen (Chair)
Zhihao Jia


Add event to Google
Add event to iCal