设计工具
存储

致力于通过左移方法彻底改变固态硬盘弹性

史蒂文·威尔斯| 2023年11月

微米 is and has been deeply committed to making world-class 固态硬盘s for the data center. We have shipped tens of millions of 固态硬盘s into data centers to date and are ramping new 固态硬盘s with our 232-layer NAND technology. A world-class 固态硬盘 includes not only attributes of power efficiency and high performance, 还有设计弹性. Resiliency means the drive will have a long and useful life in its data center home.

Defining high resiliency has been a topic of the OCP 存储 Workgroup in collaboration with device and host manufacturers. The OCP 存储 Workgroup has refined and enhanced vertically integrated high-resiliency over the three major releases of their Datacenter-NVMe-Specification (which I’ll call the “OCP 固态硬盘 Spec” for the remainder of this article). Vertically integrated resiliency is a concept that means both the host and device take on elements of making a highly resilient 存储 subsystem.

Our vision is a “shift-left” in the efforts needed to create fleet wide high resiliency. Less time debugging and replacing failed drives and more time proactively monitoring fleet health and improving ability to recover without data loss. There are multiple elements to this solution that we’ll discuss and 微米’s view of what might be next in terms of further enhancements.

OCP存储弹性架构:一种左移的防止方法, 检测, 固态硬盘故障恢复和报告

ssd的弹性历史

在OCP规范的第一个版本之前, 美光致力于实现无缝内在恢复和自退火. 这些措施包括淘汰坏块等, implementing an internal XOR solution we call Redundant Array of Independent NAND (RAIN), 并在SATA或PCIe总线上提供CRC检测和重传. 我们向SMART提供了有关此类事件的信息. We worked to collect and monitor this SMART data to help not only monitor overall fleet health and identify potential outliers but also to improve our solutions going forward.

OCP存储的弹性历史

垂直集成解决方案的第一个努力, which means both the host and device take on elements of making a highly resilient 存储 subsystem, for enhanced resiliency was championed by Microsoft and first contributed in the OCP Spec V1 where the concept of Error Recovery (logpage C1h) was introduced. This allowed the device to inform the host of an internal panic condition and instruct the host on how to fetch vendor unique debug information as well as how to perform a recovery procedure. The V1 spec supported multiple recovery actions but other parts of the spec (CRASH-4) suggested a FORMAT command., 这意味着设备上的所有数据都将被擦除并且无法恢复, 只有这样才能从内心的恐慌中恢复过来. Microsoft also offered leadership in OCP Spec V1 around the concept of Error Injection for robust vertical integration testing with both host and device participating.

The V2 specification enhanced the recovery procedure by offering additional C1h fields. This specification was the first to introduce the OCP 存储 Latency Monitor Feature. This feature allows the drive to self-report high latency I/O events and even include vendor unique debug information. This can be compared against host I/O latency logs to help root cause the problem and if it is a 存储 device issue provide clues internally to support corrective action.

V2中一些令人兴奋的功能.5 specification release recently continue to offer better vertical resiliency integration. Standardizing Telemetry is the biggest element and a majority of the new capabilities in this revision. Prior specifications revisions ultimately lead to each vendor adding unique and proprietary monitoring and debug information that would require either fetching vendor unique logpages or requiring fetching telemetry. The vendor ideally would request a binary file transfer or offer a vendor unique decode tool to generate a human readable output. OCP 固态硬盘 V2中的标准化遥测.5 spec resolved this by offering ways to both report and decode vendor unique debug with a standardized decode tool. This improves debug efficiency immediately by not needing specialized data capture and decode functions by the host.

The Standardized Telemetry project has created a simple way to collect all the important health data from distributed systems. 它使用一个I/O命令,可以与任何兼容的存储设备一起工作. 然后,主机可以捕获和解码来自第一遥测数据区域的数据. 该数据包含主机和供应商一起工作所需的所有细节. 它们可以识别出正在失败或即将失败的设备, 并改进他们未来的健康监测解决方案.

标题向前

微软的艾伯克·厄兹图尔克在 FMS 2023 他们对未来垂直一体化高弹性的愿景. They expressed a strong desire to have data recovery as part of a panic recovery vs. FORMAT命令的当前规范请求. 他们认为,随着存储设备变得越来越大, more tenants might be using a single direct attached drive and it would be desirable post panic to recover with either full (or even partial) data recovery vs. 终止多个虚拟机. 他们认为,这将促进有关利用实时迁移的概念. 探索这种解决方案的细节是2024年的一个好目标.

一个愿景

过去报道的断言和恐慌已经变成了复苏. 什么是恢复变成了侦查, 过去的侦测变成了预防. 经典的左移. 微米 is excited and committed to continue to work with the industry and OCP 存储 for this future.

沙巴体育结算平台美光与您的想法. 这是合作,所以让我们一起努力. 

研究员,架构师存储系统

史蒂文·威尔斯

史蒂文·威尔斯是美光公司的研究员, focusing on next generation 固态硬盘 solutions with over 65+ patents in the area of non-volatile 存储. He has been involved in flash component and 固态硬盘 design since 1987 and has published at multiple conferences including ISSCC, JSSC, 闪存峰会, 存储开发者大会, 和OCP全球峰会.