5th Year MSCS Thesis Presentation - Christopher Crawford October 22, 2025 10:00am — 11:30am Location: In Person - Reddy Conference Room, Gates Hillman 4405 Speaker: CHRISTOPHER CRAWFORD , Master's Student Computer Science Department Carnegie Mellon University Morphologically-Informed Tokenizers for Languages with Non-Concatenative Morphology This paper investigates the impact of using morphologically-informed tokenizers to complete the interlinear gloss annotation of an audio corpus of Yolox\'ochitl Mixtec (YM) using a combination of ASR and text-based sequence-to-sequence tools. We present two novel tokenization schemes that separate words in a nonlinear manner, preserving information about tonal morphology as much as possible. One of these approaches, a Segment and Melody tokenizer, simply extracts the tones without predicting segmentation. The other, a Sequence of Processes tokenizer, predicts segmentation for the words, which could allow an end-to-end ASR system to produce segmented and unsegmented transcriptions in a single pass. We find that these novel tokenizers are competitive with BPE models, and the Segment-and-Melody model outperforms BPE in terms of word error rate but does not reach the same character error rate. Our results suggest that nonlinear tokenizers designed specifically for the non-concatenative morphology of a language are competitive with conventional BPE models for ASR. Further research will be necessary to determine the applicability of these tokenizers in downstream processing tasks.Thesis CommitteeDavid Mortensen (Chair)Shinji WatanabeAdditional Information For More Information: amalloy@cs.cmu.edu Add event to Google Add event to iCal