Joint Speaking Skills Talk / Database Lunch Group Talk - Wan Shen Lim October 22, 2024 12:00pm — 1:00pm Location: In Person - Blelloch-Skees Conferences Room, Gates Hillman 8115 Speaker: WAN SHEN LIM, Ph.D. Student, Computer Science Department, Carnegie Mellon University https://wanshenl.me/official/ Accelerating Machine Learning for Database Systems Database tuning is the process of finding better configurations to optimize the performance of a database management system (DBMS). The size and complexity of the tuning search space makes it difficult for a human to manually discover good configurations. This necessitates the use of automated methods that rely on machine learning (ML) models to predict the DBMS's run-time behavior. These ML models enable the evaluation of candidate configurations without the expensive execution of queries. However, the high cost of obtaining the training data for these ML models make them impractical for real-world deployments. First, generating the training data itself requires expensive query execution. Second, the training data is "specialized" as it depends on instance-specific factors such as the workload, schema, database version, and more. Unlike other ML tasks, a pre-trained model cannot be downloaded off the internet; each database deployment must collect its own expensive training data from scratch. This problem is exacerbated by the frequency at which model invalidation due to changes in the DBMS's environment. Consequently, training data generation has become a major bottleneck in machine learning for database research, taking weeks or even months of time. To mitigate this problem, we make the critical observation that training data does not require accurate query results (unlike ordinary query execution). This allows us to modify query execution semantics with a "training data mode" to approximate and eliminate repetition from the training data generation process, achieving up to 268x speedup with modest degradation in model accuracy. Having addressed this bottleneck, we will also briefly discuss what the next challenge is and our next steps. Presented as part of the Database Group Lunch TalksPresented in Partial Fulfillment of the CSD Speaking Skills Requirement Event Website: https://csd.cmu.edu/calendar/joint-speaking-skills-talk-database-lunch-group-talk-wan-shen-lim