Yixin Luo Architectural Techniques for Improving NAND Flash Memory Reliability Degree Type: Ph.D. in Computer Science Advisor(s): Onur Mutlu Graduated: May 2018 Abstract: Abstract Over the past decade, NAND flash memory has rapidly grown in popularity within modern computing systems, thanks to its short random access latency, high internal parallelism, low static power consumption, and small form factor. Today, NAND flash memory is widely used as the primary storage medium for smartphones, personal laptops, and data center servers. This growth in flash memory popularity has been sustained by the decreasing cost per bit of NAND flash memory devices over each technology generation, which is due to the increased storage density. The higher density, however, comes at a cost of reduced storage reliability. Unreliable primary data storage could lead to permanent loss of valuable data. Thus, we must improve NAND flash memory reliability to prevent data loss. Existing techniques to keep NAND flash memory reliable are costly. For example, a strong error correcting code (ECC), such as low-density parity-check (LDPC) code, is typically used to tolerate up to a relatively high raw bit error rate (e.g., between 10-3 and 10-2) from the flash memory. However, such ECC requires significant redundancy, high latency, and high area overhead in today's designs. To tolerate more errors in future generations of NAND flash memories, a stronger ECC is needed, which requires an even larger amount of data redundancy, latency and area overhead than the ECC used in today's flash memory. Such ECC is not only very costly, but it also may not be as effective as other novel techniques that are specialized for different error types. Our goal in this dissertation is to greatly improve flash memory reliability at low cost. We identify three opportunities to improve the cost-efficiency of flash reliability enhancement techniques. First, we can adapt the flash controller to various NAND flash memory error characteristics, or even to the error characteristics of each individual flash chip. Second, we can adapt the flash controller to how the host uses the NAND flash memory, e.g., application access patterns and environmental temperature. Third, the flash chips are typically managed by a powerful controller within the Solid-State Drive (SSD). This powerful computing resource is underutilized when the SSD is idle or when the workload has low access intensity. We can use the flash controller to optimize flash reliability in the background without capacity or performance loss. Third, we perform the first detailed, comprehensive characterization and analysis of 3D NAND flash memory errors. Through this analysis, we identify three new error characteristics in 3D NAND flash memory due to its unique structure and cell design. We develop models for two of the new error characteristics that are significant in current-generation 3D NAND flash chips. We develop four new mechanisms within the flash controller to mitigate the three new error types, and thus greatly reduce the error rate, at low cost. Fourth, we perform the first experimental characterization of the self-recovery effect on 3D NAND flash memory and show that dwell time, i.e., the idle time between write cycles, and temperature significantly impact retention loss speed and program accuracy. We develop a new unified model of these effects, called the Unified self-Recovery and Temperature model (URT). Using this model, we propose a new technique called HeatWatch to mitigate errors due to early retention loss in 3D NAND flash memory. HeatWatch reduces the raw bit error rate by tuning the read reference voltages to the dwell time of the workload and the operating temperature of the flash memory. We show that HeatWatch efficiently tracks the temperature and dwell time of NAND flash memory and greatly mitigates retention errors in 3D NAND flash memory using this information. Overall, this dissertation (1) deepens the understanding of the error characteristics of both planar and 3D NAND flash memory through rigorous experimental characterization and, (2) develops new flash controller algorithms that improve NAND flash memory reliability (both lifetime and error rate) at low cost by taking advantage of the flash device and workload characteristics that we find based on our new understandings. Thesis Committee: Onur Mutlu (Chair) Philiip B. Gibbons James C. Hoe Yu Cai (SK Hynix) Erich F. Haratsch (Seagate Technology) Frank Pfenning, Head, Computer Science Department Andrew W. Moore, Dean, School of Computer Science Keywords: Flash Memory, NAND Flash Memory, 3D NAND Flash Memory, Solid-state Drives (SSD), Nonvolatile Memory (NVM), Integrated Circuit Reliability, Error Characterization, Error Mitigation, Error Correction, Error Recovery, Data Recovery, Process Variation, Data Retention, Memory Systems, Memory Controllers, Data Storage Systems, Fault Tolerance, Computer Architecture CMU-CS-18-101.pdf (13.18 MB) ( 255 pages) Copyright Notice