Doctoral Thesis Oral Defense - Shuqi Dai

— 2:00pm

Location:
In Person and Virtual - ET - Gates Hillman 8102 and Zoom

Speaker:
SHUQI DAI, Ph.D. Candidate , Computer Science Department, Carnegie Mellon University
https://www.shuqid.net/


Towards Artificial Musicians: Empowering Individual Music Expression In Composition, Performance, and Synthesis Through Machine Learning

Recent advances in music technology and generative AI have revolutionized music creation, transforming how we interact with music in various aspects of life. However, achieving high musicality and customizing music to individual preferences remain significant challenges. This thesis addresses five fundamental problems in current AI-driven music understanding and creation: (1) multimodal music representation, (2) highly complex and logical music structure, (3) stylistic and personalization controls, (4) data scarcity and copyright, and (5) ethical concerns. This work integrates music domain knowledge with machine learning to effectively overcome these obstacles, by focusing on a practical application: creating virtual musicians or "re-creating" existing musicians.

First, guided by music expertise, I introduce novel algorithms that analyze music data to identify and explore principles underlying music expression, with a focus on music repetition and structure hierarchy. Next, these principles are applied across three levels of music creation: symbolic composition, expressive performance control, and audio synthesis. For symbolic composition, both statistical machine learning and deep learning techniques are employed to compose melodies, harmonies, and bass lines that imitate specific music styles given examples. Expressive performance control, highly crucial in music creativity but often ignored, is realized through diffusion models that generate timing, pitch, dynamics, and singing techniques. Audio synthesis is demonstrated through singing synthesis, which involves generating vocals from scratch and transferring vocal timbres, including zero-shot and cross-domain synthesis and conversion of unseen speech reference. These approaches converge to model music expression across multimodal music representations.

This thesis emphasizes individual music preference and stylistic modeling, offering various controls for composition, performance, and synthesis. In symbolic composition, controls range from micro-level elements such as rhythm patterns and melodic contour, to macro-level features like song style, structure, and harmony. In singing performance and synthesis, controls include language, style genre, and singing techniques, with zero-shot capabilities to customize specific vocal timbres.

Experiments validate the effectiveness of the models, demonstrating competitive performance to human music. Ethical and legal concerns are also discussed. Finally, I highlight potential applications for advancing these technologies in areas like music therapy, education, human-computer interactive performance systems, and the development of world music theory.

Thesis Committee

Roger B. Dannenberg (Chair)
Chris Donahue
Junyan Zhu
Julius O. Smith (Stanford University)
Gus Guangyu Xia (Mohamed bin Zayed University of Artificial Intelligence)

In Person and Zoom Participation.  See announcement.

Event Website:
https://csd.cmu.edu/calendar/doctoral-thesis-oral-defense-shuqi-dai