5th Year MSCS Thesis Presentation - Christopher Crawford

— 11:30am

Location:
In Person - Reddy Conference Room, Gates Hillman 4405

Speaker:
CHRISTOPHER CRAWFORD , Master's Student
Computer Science Department
Carnegie Mellon University

Morphologically-Informed Tokenizers for Languages with Non-Concatenative Morphology

This paper investigates the impact of using morphologically-informed tokenizers to complete the interlinear gloss annotation of an audio corpus of Yolox\'ochitl Mixtec (YM) using a combination of ASR and text-based sequence-to-sequence tools. We present two novel tokenization schemes that separate words in a nonlinear manner, preserving information about tonal morphology as much as possible. One of these approaches, a Segment and Melody tokenizer, simply extracts the tones without predicting segmentation. The other, a Sequence of Processes tokenizer, predicts segmentation for the words, which could allow an end-to-end ASR system to produce segmented and unsegmented transcriptions in a single pass. We find that these novel tokenizers are competitive with BPE models, and the Segment-and-Melody model outperforms BPE in terms of word error rate but does not reach the same character error rate. Our results suggest that nonlinear tokenizers designed specifically for the non-concatenative morphology of a language are competitive with conventional BPE models for ASR. Further research will be necessary to determine the applicability of these tokenizers in downstream processing tasks.

Thesis Committee
David Mortensen (Chair)
Shinji Watanabe

Additional Information 

For More Information:
amalloy@cs.cmu.edu


Add event to Google
Add event to iCal