Todd Mowry Professor Website Office 9113 Gates and Hillman Centers Email tcm@cs.cmu.edu Phone (412) 268-3725 Department Computer Science Department Administrative Support Person Marcella Baker Research Interests Systems Computer Architecture Databases Advisees Sam Arch Patrick Coppock Hongyi Jin Ruihang Lai Eliot Solomon CSD Courses Taught 15418 - Spring, 2025 15618 - Spring, 2025 15745 - Fall, 2024 15418 - Spring, 2024 15618 - Spring, 2024 The goal of my research is to dramatically boost the performance of future microprocessor-based systems. To accomplish this, we exploit various forms of parallelism through a combination of novel architectural, compiler and operating systems support. In particular, we have been focusing on the opportunities and challenges created by two important VLSI technology trends which are expected to reshape computer systems over the next decade: the potential for single-chip multiprocessing due to higher levels of single-chip integration, and the need to tolerate off-chip latency as the gap between processor speed and the speed of memory and I/O continues to widen. Single-Chip Multiprocessing: The STAMPede Project. As advances in integrated circuit technology continue to provide more and more transistors on a chip, processor architects are faced with the pleasant challenge of finding the best way to translate these additional resources into improved performance. One of the more compelling options is to integrate multiple processors onto the same chip. While this will certainly increase computational throughput, it will only reduce execution time of a given application if it can be run in parallel. Hence the key question is how do we convert the applications that we care about into parallel programs? Expecting programmers to only write parallel programs from now on is unrealistic. Instead, the preferred solution would be for the compiler to parallelize programs automatically. Unfortunately, compilers have only been successful so far at parallelizing the numeric applications commonly run on supercomputers. For single-chip multiprocessing to have an impact on the majority of users, we must also find a way to automatically parallelize the non-numeric applications (e.g., spreadsheets, web software, graphics codes, etc.) which account for the bulk of the software run on commercial microprocessors. Based on our preliminary studies, we believe that a breakthrough in our ability to automatically parallelize non-numeric applications may be possible through "thread-level data speculation", which is a technique that allows the compiler to safely parallelize applications in cases where it believes that dependences are unlikely, but cannot statically prove that they do not exist. To accomplish this, we add modest hardware support to track data dependence violations at run-time and alert the software so that it can recover appropriately. Developing the architectural, compiler, and operating system support necessary to turn this potential into a reality is the goal of the STAMPede (Single-chip Tightly-coupled Architecture for MultiProcessing) project. Coping with Large Latencies. Processor speeds are continuing to increase far more rapidly than off-chip components such as DRAM, disk, and networks, largely due to physical limitations such as distance and the speed of light. The challenge presented by this trend is that from the processor's perspective, the latency of main memory and I/O is increasing at a dramatic rate, and thus threatens to become an increasingly important performance bottleneck. The good news, however, is that the bandwidth of these off-chip devices has been improving through innovations such as synchronous (i.e. pipelined) DRAM, disk arrays, and fiber optic networks. Therefore we are exploring new ways that the compiler (with varying degrees of help from the hardware and the operating system) can use prefetching and other techniques to intelligently trade off consuming more bandwidth to reduce overall latency. Recent work in this area has included prefetching pointer-based codes, prefetching to hide disk latency in out-of-core numeric applications, and hiding network communication latency in workstation clusters. Publications Preprint A System for Microserving of LLMs 2024 Jin H, Lai R, Ruan CF, Wang Y, Mowry TC, Miao X, Jia Z, Chen T Journal Article Address Scaling: Architectural Support for Fine-Grained Thread-Safe Metadata Management 2024 • IEEE Computer Architecture Letters • 23(1):69-72 Mishra D, Kanellopoulos K, Panwar A, Sriraman A, Seshadri V, Mutlu O, Mowry TC Preprint ACRoBat: Optimizing Auto-batching of Dynamic Deep Learning at Compile Time 2023 Fegade P, Chen T, Gibbons PB, Mowry TC Preprint ED-Batch: Efficient Automatic Batching of Dynamic Neural Networks via Learned Finite State Machines 2023 Chen S, Fegade P, Chen T, Gibbons PB, Mowry TC Conference ED-Batch: Efficient Automatic Batching of Dynamic Neural Networks via Learned Finite State Machines 2023 • INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 202 • 202: Chen S, Fegade P, Chen T, Gibbons PB, Mowry TC
Preprint A System for Microserving of LLMs 2024 Jin H, Lai R, Ruan CF, Wang Y, Mowry TC, Miao X, Jia Z, Chen T
Journal Article Address Scaling: Architectural Support for Fine-Grained Thread-Safe Metadata Management 2024 • IEEE Computer Architecture Letters • 23(1):69-72 Mishra D, Kanellopoulos K, Panwar A, Sriraman A, Seshadri V, Mutlu O, Mowry TC
Preprint ACRoBat: Optimizing Auto-batching of Dynamic Deep Learning at Compile Time 2023 Fegade P, Chen T, Gibbons PB, Mowry TC
Preprint ED-Batch: Efficient Automatic Batching of Dynamic Neural Networks via Learned Finite State Machines 2023 Chen S, Fegade P, Chen T, Gibbons PB, Mowry TC
Conference ED-Batch: Efficient Automatic Batching of Dynamic Neural Networks via Learned Finite State Machines 2023 • INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 202 • 202: Chen S, Fegade P, Chen T, Gibbons PB, Mowry TC