Saurabh Kadekodi DISK-ADAPTIVE REDUNDANCY: tailoring data redundancy to disk-reliability-heterogeneity in cluster storage systems Degree Type: Ph.D. in Computer Science Advisor(s): Greg Ganger, Rashmi Vinayak Graduated: December 2020 Abstract: Large-scale cluster storage systems contain hundreds-of-thousands of hard disk drives in their primary storage tier. Since the clusters are not built all at once, there is significant heterogeneity among the disks in terms of their capacity, make/model, firmware, etc. Redundancy settings for data reliability are generally configured in a "one-scheme-fits-all" manner assuming that this heterogeneous disk population has homogeneous reliability characteristics. In reality we observe that different disk groups fail differently, causing clusters to have significantly high disk-reliability heterogeneity. This dissertation paves the way for exploiting disk reliability heterogeneity to tailor redundancy settings to different disk groups for cost-effective, and arguably safer redundancy in large-scale cluster storage systems. Our first contribution is an in-depth data-driven analysis of disk reliability of over 5.3 million disks across over 60 makes/models in three large production environments (Google, NetApp and Backblaze). We observe that the strongest disks can be over an order of magnitude more reliable than the weakest disks in the same storage cluster. This makes today's static redundancy schemes selection either insufficient, or wasteful, or both. We identify and quantify the opportunity of achieving lower storage cost along with increased data protection by means of disk-adaptive redundancy. Our next contribution is designing the heterogeneity-aware redundancy tuner (HeART), an online tuning tool that guides selection of different redundancy settings for long-term data reliability, based on observed reliability properties of each disk group. By processing disk failure data over time, HeART identifies the boundaries and steady-state failure rate for each deployed disk group by make/model. Using this information, HeART suggests the most space-efficient redundancy option allowed that will achieve the specified target data reliability. HeART is evaluated using longitudinal disk failure logs from a large production cluster with over 100K disks. Guided by HeART, the cluster could meet target data reliability levels with much fewer disks than one-scheme-for-all approaches: 11-16% fewer compared to erasure codes like 10-of-14 or 6-of-9 and up to 33% fewer compared to 3-way replication. While HeART promises substantial space-savings, it is rendered unusable in production settings of real-world clusters, because the IO load of transitions between redundancy schemes overwhelms the storage infrastructure (termed transition overload). Analysis on Google's cluster traces shows transition overload consuming 100% of the cluster IO bandwidth for weeks together, making transition overload a show-stopper for practical disk-adaptive redundancy. Building on the insights drawn from our data-driven analysis, Pacemaker is the next contribution of this dissertation; a low-overhead disk-adaptive redundancy orchestrator that realizes HeART's dream in practice. Pacemaker mitigates transition overload by (1) proactively organizing data layouts to make future transitions efficient, (2) initiating transitions proactively in a manner that avoids urgency while not compromising on space-savings, and (3) designing more IO efficient redundancy transitioning mechanisms. Evaluation of Pacemaker with traces from four large (110K-450K disks) production clusters (three from Google and one from Backblaze) shows that the transition IO requirement decreases to never needing more than 5% cluster IO bandwidth (only 0.2-0.4% on average). Pacemaker achieves this while providing overall space-savings of 14-20% (compared to using a static 6-of-9 scheme) and never leaving data under-protected. The final contribution of this dissertation is the design and implementation of disk-adaptive redundancy techniques from Pacemaker in the widely used Hadoop Distributed File System (HDFS). This prototype re-purposes HDFS's existing architectural components for disk-adaptive redundancy, and successfully leverages the robustness and maturity of the existing code. Moreover, the components that are re-purposed are fundamental to any distributed storage system's architecture, and thus, this prototype also serves as a guideline for future systems that wish to support disk-adaptive redundancy. Thesis Committee: Gregory R. Ganger (Co-Chair) K.V. Rashmi (Co-Chair) Garth A. Gibson (CMU / Vector Institute) Arif Merchant (Google Inc.) Remzi Arpaci-Dusseau (University of Wisconsin-Madison) Srinivasan Seshan, Head, Computer Science Department Martial Hebert, Dean, School of Computer Science Keywords: Reliability, durability, fault-tolerance, redundancy, distributed storage systems, cluster storage systems, disks, HDD, erasure code, replication, heterogeneity CMU-CS-20-142.pdf (3.38 MB) ( 111 pages) Copyright Notice