Abutalib Aghayev Adopting Zoned Storage in Distributed Storage Systems Degree Type: Ph.D. in Computer Science Advisor(s): Georget Amvrosiadis Graduated: August 2020 Abstract: Hard disk drives and solid-state drives are the workhorses of modern storage systems. For the past several decades, storage systems software has communicated with these drives using the block interface. The block interface was introduced early on with hard disk drives, and as a result, almost every storage system in use today was built for the block interface. Therefore, when flash memory based solid-state drives recently became viable, the industry chose toe mulate the block interface on top of flash memory by running a translation layer inside the solid-state drives. This translation layer was necessary because the block interface was not a direct fit for the flash memory. More recently, harddisk drives a reshifting to shingled magnetic recording, which increases capacity but also violates the block interface. Thus, emerging hard disk drives are also emulating the block interface by running a translation layer inside the drive. Emulating the block interface using a translation layer, however, is becoming a source of significant performance and cost problems in distributed storagesystems. In this dissertation, we argue for the elimination of the translation layer–and consequently the block interface. We propose adopting the emerging zone interface instead–a natural fit for both high-capacity hard disk drives and solid-state drives–and rewriting the storage backend component of distributed storage systems to use this new interface. Our thesis is that adopting the zone interface using a special purpose storage backend will improve the cost-effectiveness of data storage and the predictability of performance in distributed storage systems. We provide the following evidence to support our thesis. First, we introduce Skylight, a novel technique to reverse engineer the translation layer of modern hard disk drives and demonstrate the high garbage collection overhead of the translation layers. Second, based on the insights from Skylight we develop ext4-lazy, an extension of the popular ext4 file system, which is used as a storage backend in many distributed storage systems. Ext4-lazy significantly improves performance over ext4 on hard disk drives witha translation layer, but it also shows that in the presence of a translation layer it is hard to achieve the full potential of a drive with evolutionary file system changes. Third,we show that even in the absence of a translation layer, the abstractions provided by general-purpose file systems such as ext4 are inappropriate for a storage backend. To this end, we study a decade-long evolution of a widely used distributed storage system, Ceph, and pinpoint technical reasons that render general purpose file systems unfit for a storage backend. Fourth, to show the advantage of a special-purpose storage backend in adopting the zone interface, as well as the advantages of the zone interface itself, we extend BlueStore–Ceph's special-purpose storage backend–to work on zoned devices. As a result of this work, we demonstrate how having a special-purpose backend in Cep henables quick adoption of the zone interface, how the zone interface eliminates in-device garbage collection when running RocksDB(a key-value database used for storing metadata in BlueStore),and how the zone interface enables Ceph to reduce tail latency and increase cost-effectiveness of data storage without sacrificing performance. Thesis Committee: George Amvrosiadis (Chair) Gregory R. Ganger Garth A. Gibson Peter J. Desnoyers (Northeaster University) Remzi H. Apraci-Dusseau (University of Wisconsin-Madison) Sage A. Weil (Red Hat) Srinivasan Seshan, Head, Computer Science Department Martial Hebert, Dean, School of Computer Science Keywords: File systems, distributed storage systems, shingled magnetic recording, zoned name spaces, zoned storage, hard disk drives, solid-state drives CMU-CS-20-130.pdf (5.83 MB) ( 143 pages) Copyright Notice