MSCS Thesis Defense - Charlie Ruan May 1, 2025 10:00am — 11:00am Location: In Person - Reddy Conference Room, Gates Hillman 4405 Speaker: CHARLIE RUAN , Master's Student, Computer Science Department, Carnegie Mellon University https://www.linkedin.com/in/charlie-ruan Democratizing On-Device LLM Inference with Machine Learning Compilers and Web Technologies Large language models (LLMs) have traditionally relied on cloud-based inference due to their high computational and memory demands. However, recent advances in small LLMs and consumer hardware capabilities have made on-device inference increasingly practical. Among potential deployment targets, the web browser stands out as a uniquely compelling platform: it is universally accessible, naturally abstracts out hardware heterogeneity, requires no dependency installation for web applications, and provides a natural agentic environment for task automation. WebLLM is a high-performance TypeScript framework that enables LLM inference entirely within web browsers. WebLLM compiles LLMs ahead of time using the MLC-LLM and Apache TVM compiler stack to generate optimized WebGPU kernels and a portable WebAssembly runtime. The system exposes a familiar OpenAI-style API, supports efficient GPU acceleration, and integrates seamlessly with browser environments using Web Workers and WebAssembly. To enable structured generation, which is especially challenging for small LLMs, WebLLM incorporates XGrammar, an efficient grammar-constrained decoding engine, allowing developers to enforce output formats such as JSON or DSLs with near-zero overhead. Together, these components demonstrate a path toward democratizing LLM access, making intelligent, private, and responsive AI experiences universally available through the web. Thesis CommitteeTianqi Chen (Chair)Zhihao Jia Add event to Google Add event to iCal