Non-deterministic performance of block storage

Since we are talking “today” we’ll restrict ourselves to solid-state media: NAND flash. For hard disk drives it is well known that performance is determined by rotational latency and the seek times. If we break down the layers in an I/O path the following appear:

  1. Host interfaces and driver stack (e.g., SATA, SAS, NVMe)
  2. Controller (i.e. how many channels, NAND dies per channel)
  3. NAND media (i.e. physical properties like erase before re-write and difference between program times to do a read vs. a write).

Let’s look at each layer a little closer.

Host Interfaces
The SATA interface goes to 6 Gbps. SAS is currently at 12Gbps. NVMe, which is a PCI-e interface, has x2; x4 and x8 lane options (currently gen 3 @ 10 Gbps per lane). Really it is NOT an apple to apples comparison amongst these; especially when the NAND media is capable of saturating these interfaces!! More about that when we compare the solid state disk (SSD) by interface and the controller designs.

Question: Does the host interface add to the non-deterministic performance?
Answer: In my opinion; it does not! Yes, link speeds matter but the protocol overheads are constant; so add little to the variability of the performance.

Host Drivers
As is shown in the diagram below (courtesy IDF 2011 presentation) from a latency point of view, the SATA/SAS host stack requires ~30K instructions to issue an I/O. The NVMe stack requires ~9k per I/O. One way to think of this is in terms of the CPU/IO cost. The average the CPU cost of issuing an I/O for an NVMe interface is one-third that of the SCSI stack. In terms of time, given a 3.0GHz processor, 30k instructions will take about 10us per I/O, while the NVMe stack is ~3 us.

 

If you look at the right side of the graph (Clocks/IO) one can see that the CPU/IO cost increases from ~22k/IO to ~40k/IO as the number of cores increase. The majority of the cause seems to be tied to how interrupts are processed; however there are a lot of other areas where there is lock contention, memory allocations etc. which lead to this behavior.

Question: Does the host driver stack affect the non-determinism in storage performance?
Answer: It is clear that the driver stack does not scale linearly. Servers today have 18 cores per socket, hence the need for a scalable I/O stack such as NVMe. Please note that some changes have been done to the SCSI stack such as scsi-mq (scsi-multi-queue) which help address some of these scalability issues.

Controller
The figure below depicts a general SSD controller. The NAND chips are laid out in channels (with each chip individually programmable and each channel accessed individually or in parallel). The controller then exposes these via LUNs (name spaces in NVMe) using SATA; SAS or PCI-e interface(s).

In our experience, 8 channels is more than enough to saturate the 6 Gbps SATA link (typically data is laid out in a 7+1 fashion). The diagram below would be for a 12 Gbps SAS SSD (as it shows 16 channels). Expect 32 channels for a typical NVMe SSD (PCI-e x4 interface). Therefore, it is clear that comparing a SATA to SAS to PCI-e (all in the same 2.5’’ Form factor) is NOT an apple to apples comparison!

The more channels (and therefore chips or dies) there are behind a controller the more performance one should expect to get from the SSD. Moveover, as the number of chips increases, one should expect a linear increase in performance – both I/O operations per second (IOPS) and latency – and predictable performance until a saturation point in one of the components is reached.

Picture


NAND Media

A lot of material is available on the properties of NAND flash and its various types (e.g., SLC, MLC, e-MLC). The main cause of non-deterministic performance in my view is the fact that NAND is NOT an overwrite medium. This manifests itself in all SSD devices as a discrepancy between IOPS and latency for read and write operations. SSD generally have reserved capacity or over-provisioning in order to mitigate garbage collection effects that deal with overwrites. We have seen over-provisioning from various SSD vendors in the range of 7-28% of NAND capacity. As is amply documented, the efficiency of garbage collection is directly related to the amount of free (reserved) capacity. Therefore a 28% over-capacity SSD will yield higher write IOPS.

One other thing to point out is the fact that block devices usually have a block size of 4k, which means that the operating system is issuing I/O in units of 4k sizes. As the NAND geometries have been shrinking one effect is the increase in NAND page size. Per my understanding, the current NAND page size we have seen across various SSD vendors is 16k (for MLC). This has an indirect effect on the efficiency of garbage collection, in the sense that the minimum size of write is now 16k.

SSD vendors have considerable IP in the design of these algorithms, and hold these techniques close to their chest. There have been improvements in these designs over generations of SSD, as is amply apparent if one looks at the IOPS & latency (not to mention density) improvements each generation of SSD has provided.

Conclusion
We have taken a myopic view of SSD design and tried to break down it’s parts. However from a systems point of view larger questions remain when one realizes that there are multiple SSD in a typical system, capable of millions of IOPS:

(1) How can I share this performance among various workloads running on the system?

(2) Can I provide meaningful quality of service (in terms of latency and IOPS) among the various workloads on the system? Without interference? Without over-provisioning and hard-coding a single device to a workload?