The fast-emerging NVMe interface rewrites the rules for connecting fast SSD directly to servers.
Micron announced the first PCIe-based SSD product to use NVMe almost six months ago, and now, with NVMe gaining wider acceptance, we can expect a raft of followers. NVMe makes much sense for flash accelerators, too, with the expectation that performance speed-ups will be quite substantial.
NVMe is a communications protocol and device interface layer that has been optimized for PCIe-based SSDs. It lines up with SAS as a protocol and has similar command capabilities. Where it differs is in how data is transmitted, and how the driver interfaces with the OS on one side and the SSD on the other. The secret sauce is that the combination of NVMe and PCIe has been crafted to overcome most of the bottlenecks in a system that had stood in the way of performance with the very fastest flash memories.
New drive controller chips hitting the market from IDT now provide 4 PCIe or 8 PCIe 3.0 lanes for each drive. That's equivalent to more than two quad-lane SAS connections, based just on raw speed. PCIe transfers data at 8Gbps, compared to 6Gbps with SAS. Next-generation products will deliver as much as 16Gbps.
But that's not the whole story, by any means. The problem with any fast protocols is how they handle interrupts and the resulting overhead. SAS has a traditional driver architecture. Each I/O generates its own interrupt, and the file system/driver stack has multiple layers. I/Os are SENT to the drive, which queues up a bunch and then works on them.
To match the speed of the new SSD or flash accelerators, NVMe uses a completely novel method of interfacing. Instead of transmitting the I/O to the drive, the driver builds queues of I/Os for each drive, and the drive takes advantage of direct memory access (DMA) over PCIe to PULL a queue entry when it needs a new I/O to process. The drive enters completions in a corresponding queue, but doesn't interrupt the system.
The host driver maintains queues for each processor core, and it has a mechanism for vectoring a single interrupt for a number of completions back to the originating core. These queue features speed up processing dramatically. Just avoiding context swapping on interrupts is a huge saving, and the concept of "affinitizing" jobs for efficiency in processing and caching has taken a big step forward.
The bottom line is that NVMe is far faster than SAS.
It's worth noting here that a SAS quad offers three times the performance of Fibre Channel (FC), which raises the question: Is SAS doomed, just like FC seems to be? FC is getting squeezed out between SAS, for short-haul storage connectivity, and Ethernet, for longer distances. Short-term, SAS is well entrenched, and infrastructure such as switches and adapters exist, together with motherboard ports. NVMe won't have much affect on FC in 2013.
Starting in 2014, though, the situation grows more complex. PCIe has a short connection length, measured only in inches. The extension protocol, Thunderbolt, looks fast, at 10Gbps per lane, but the architecture doesn't seem capable of extending PCIe3 connections at anything like native speed. It is also has a limit of two lanes per configuration today. Unless Thunderbolt evolves substantially, from serving merely as a challenger to USB, it isn't likely to produce, say, a 5-meter capability with SAS-level performance.
Another way to look at this is to consider how a system might be built. It's quite possible that any system needing NVMe-class speeds needs only two or four local drives, which are connectable by PCIe inside the server box. Any other connectivity to storage would still come from SAS or 10/40/100GE to external storage boxes. I think this is the model looking forward to the 2014-16 timeframe.
Challenges galore
In this scenario, NVMe doesn't challenge SAS, but as a protocol over Ethernet, it might challenge traditional NAS-type connectivity to no end. At 100GE rates, we have all the same problems that NVMe was created to resolve. NVMe over 100GE makes quite a bit of sense, especially if variable block sizes are allowed, to cater for storage data blocks.
One final question: Is NVMe fast enough? We are looking at persistent memory with single-word write and read capability. I've done some architectural work in that area, recently, and the challenges (and incredible opportunities) of being able to change one persistent word requires a further extension to how a system interfaces to its storage.
Whether this involves directly addressing a broader address range over the PCIe interface, thus allowing direct writes/reads using memory-register CPU operands, or a faster stack than NVMe to cater for near-synchronous IOs, promises to be a particularly interesting debate.
Your thoughts?
Related posts: