Computer Science Master of Science Thesis Defense May 3, 2023 3:30pm — 4:30pm Location: In Person - Tepper Building 1403 Speaker: JUN (JOHN) LUO, Master's Student, Computer Science Department , Carnegie Mellon University Using Computer Vision and Machine Learning to Unlock Historical Data Historical data, especially those recorded in tables and forms, have significant value for contemporary research and industry applications. However, such data is rarely digitized or available in readily usable formats such as Excel sheets and database tables. Using historical property appraisals as a case study, we demonstrate how machine learning and computer vision methods can help address this data gap in a cost-effective way. The earliest standardized property appraisal records in the United States were typically handwritten on physical cards. Using scanned cards from Ohio in the 1930s, we test approaches to digitize a property's earliest appraised value. We find that image processing and Optical Character Recognition (OCR) deep learning models can retrieve this value accurately with a Mean Absolute Percentage Error (MAPE) of 14.72\%. For cases where OCR cannot be applied, such as when scanned documents are not available, our machine learning model can use contemporary data to estimate this value with a reduced accuracy of 17.48\% MAPE. Both methods present a substantial saving over manually digitizing the same data, with OCR achieving a cost reduction of 81\% and the machine learning model achieving a cost reduction of 89%. Historical data, especially those recorded in tables and forms, have significant value for contemporary research and industry applications. However, such data is rarely digitized or available in readily usable formats such as Excel sheets and database tables. Using historical property appraisals as a case study, we demonstrate how machine learning and computer vision methods can help address this data gap in a cost-effective way. The earliest standardized property appraisal records in the United States were typically handwritten on physical cards. Using scanned cards from Ohio in the 1930s, we test approaches to digitize a property's earliest appraised value. We find that image processing and Optical Character Recognition (OCR) deep learning models can retrieve this value accurately with a Mean Absolute Percentage Error (MAPE) of 14.72%. For cases where OCR cannot be applied, such as when scanned documents are not available, our machine learning model can use contemporary data to estimate this value with a reduced accuracy of 17.48% MAPE. Both methods present a substantial saving over manually digitizing the same data, with OCR achieving a cost reduction of 81% and the machine learning model achieving a cost reduction of 89%. Additional Information Thesis Committee: Matt Gormley (Chair) Rayid Ghani