The advent of all-flash Solid-State Drives (SSDs) has transformed the storage function in enterprise and cloud computing. SSDs offer greatly improved storage performance compared to traditional spinning hard disk drives (HDDs). It is therefore not surprising that 18% of businesses are now using all-flash storage, with industry research predicting that a further 14% will have adopted it by 2022. The all-flash storage market is growing briskly at 11% per year. The adoption of compute acceleration technologies, more modern applications, and faster storage media is leading IT infrastructure architects to confront the challenge of networking–getting the data from the storage media to the application that needs it.
Most infrastructure now in place relies on the Internet Small Computer Systems Interface (iSCSI) standard–the networked version of the bus SCSI protocol–to link storage with servers. iSCSI carries SCSI commands over a TCP/IP network in order to deliver block-level access to storage devices. The difficulty is that iSCSI is a serial technology. It is therefore far too slow to accommodate the massively parallel nature of solid-state storage drives. Non-Volatile Memory (NVM) Express, or NVMe, is a much better solution for this workload.
However, NVMe, too, has drawbacks. For example, NVMe devices are great in that they deliver performance, but can suffer from the local “puddling” of storage resources. The innovative NVMe over Fabrics (NVME-oF) protocol will work well in theory, but in reality, the various implementations: Fiber Channel (FC), RDMA over Converged Ethernet (RoCE), InfiniBand, iWARP, TCP/IP are not all created equal in terms of ease of use and cost. Thus it makes it a difficult option to select. In contrast, NVMe over TCP/IP is superior in all respects: performance, ease of use, and cost-efficiency because it utilizes standard TCP/IP networking.
iSCSI Overview
The IT hardware industry has been addressing the requirement to connect storage devices with computers for many decades. Starting in the 1990s, companies like Apple and Microsoft worked together to publish the original SCSI X3.131-1986 standard, known as “SCSI-1” with the X3T9 technical committee of the American National Standards Institute (ANSI). In this effort, the industry benefited from the American slang term “scuzzy,” which connotes sloppy, unattractive behavior as an ironic counterpoint to the technology’s slick capabilities.
Based on a shared bus transport model, SCSI comprises a set of standards for the physical connection and transfer of data between peripheral devices and computers. It defines protocols, commands, and interfaces for almost any kind of peripheral, but in practice, it was mostly used for hard disk drives.
Computers and storage technologies proceeded to grow larger and faster over time. With the resulting increases in bandwidth and distance requirements, the industry developed several new serial SCSI transports, including:
- Fibre Channel (FC)—a high-speed data transfer protocol that provides lossless delivery of raw block data. FC is mostly used in the connection of storage to servers in storage area networks (SANs). An FC network is a “switched fabric,” so-named due to its having switches in a network operate in unison, effectively as “one big switch.” FC networks usually run on fiber optic cables.
- Serial Attached SCSI (SAS)—a point-to-point serial protocol that moves data in and out of storage devices like HDDs or tape drives. It uses the standard SCSI command set.
- SCSI over TCP/IP (iSCSI)—a transport layer protocol that works over the Transport Control Protocol (TCP). iSCSI makes it possible to transport block-level SCSI data between an iSCSI “initiator” and a storage device using common TCP/IP network connections.
What is iSCSI?
SCSI – literally Small Computer Systems Interface. This protocol has been the de facto standard for computer and server access to storage for decades. The SCSI set of standards is used to carry blocks of data over short distances using a variety of serial and parallel busses internal to a computer. The widespread use of the Transmission Control Protocol/Internet Protocol (TCP/IP), which connects computer systems to TCP/IP networks such as the internet, brought together SCSI and the internet, creating iSCSI.
To understand iSCSI, it is first necessary to get on top of the basic SCSI protocol. SCSI is a block-based set of commands designed for Direct-Attached Storage (DAS) use cases. Through SCSI commands, a computer can send instructions to spin up storage media and execute data reads/writes. With SCSI, the client is known as the “initiator.” The storage volume it accesses is called the “target.”
Access between an initiator and target using the SCSI protocol comprises a queue of commands. The initiator sends commands to be performed by the target in a queue. The initiator then receives data and acknowledgments for each of its queued commands. There is a single queue for each connection between an initiator and target.
SCSI vs. iSCSI
iSCSI is SCSI transport over TCP/IP and designed for Network Attached Storage (NAS) use cases. With iSCSI, therefore, a connection between the initiator and target consists of a single TCP/IP socket opened by a single thread. As iSCSI is an updating of the old SCSI, the ubiquitous storage protocol—computer operating systems (OS’s) implemented the same model in their storage stacks. iSCSI represents a situation where access to a storage volume is always mediated by a single queue, which corresponds to the single queue in the connection between the initiator and target.
iSCSI Performance
iSCSI has become the common standard for block storage over standard networks due to the widespread adoption of TCP/IP in those networks. This is not optimal from a performance perspective, however. The issue is that storage for applications that need low latency now use SSDs, and iSCSI is not designed to work well with SSDs. Rather, TCP/IP protocols are simply transports for the same, original SCSI protocol developed for spinning hard drives.
This single queue model was not a problem for many years. HDDs were in fact well-suited to the single queue because most hard disk drives had a single read/write head. Flash memory behaves quite differently. It does not utilize a “single head” that serializes data access. Instead, flash memory is arranged as banks of storage elements, similar to how dynamic random access memory (DRAM) works.
Servers have also changed, which further affects the efficacy of the single queue iSCSI model. A typical server today may have dozens of CPU cores. As a result, storing data coming from several cores in parallel requires mediation by locks on the single storage queue. This is not a big problem for spinning HDDs that perform at a rate of only about 100 inputs/outputs per second (IOPs). A typical enterprise storage system built with HDDs might perform at several thousand IOPs per volume. When storage migrates to flash, this all changes dramatically.
The single queue model is problematic with flash SSDs. A single SSD is capable of hundreds of thousands of IOPs. The resulting queue contention translates into a serious performance issue for a server with numerous cores. The iSCSI single queue model creates further difficulties with the standard implementation of a single thread and single TCP socket.
It is possible to perform multipath I/O by surfacing a logical unit via several target ports and load balancing I/O over multiple threads and sockets. But, this strategy is complicated to manage on the target side. It invariably suffers from queueing issues on the initiator side. Multiple paths are still treated as a single device by the operating system. Performance will inevitably lag.
The NVMe solution and its limitations
Getting better performance out of SSDs required a new communication protocol. Over ninety companies got together to collaborate through the NVM Express Workgroup. Intel was one of the leading firms represented in the effort to develop the resulting open, logical-device interface specification that enables a computer to access non-volatile (i.e., flash) storage media. The typical implementation involved a machine’s PCI Express (PCIe) bus.
NVM Express®, or NVMe®, is designed to take advantage of the internal parallelism and inherent low latency of SSDs. NVMe lets its host hardware and software exploit parallelism to a great degree. From this, NVMe reduces I/O overhead, which drives improvements in performance in comparison to iSCSI.
NVMe has been transformative, emerging as the new standard protocol for accessing high-performing SSDs. The initial limitation, however, concerned the fact that NVMe was designed for direct-attached PCIe SSDs. The specification was not able to work for networked storage. To correct this deficiency, the NVMe working group developed NVMe-oF®, which is able to support a rack-scale remote pool of SSDs over network fabrics.
The IT industry has generally accepted that NVMe-oF will replace iSCSI as the preferred mode of communication between compute servers and storage servers. It will become the default protocol for disaggregating storage and compute servers.
There are some limitations to NVMe-oF, however. Initial deployment options for NVMe-oF were limited to Ethernet over Remote Direct Memory Access (RDMA) and Fibre Channel fabrics. These are only suitable for small-scale deployments. Indeed, most data centers do not currently support NVMe over Fabrics using technologies such as RDMA or Fibre Channel.
Issues with RDMA or Fibre Channel include:
- Requiring specialized, not commonly used equipment for data centers
- Being generally out of the mainstream
- Requiring significant changes in the data center’s network infrastructure
- Locking the customer to a single vendor
Introducing NVMe/TCP
Given the potential, but also the limitations of NVMe-oF, it made sense for the industry to find a way to take advantage of the performance characteristics of NVMe, but do so in a way that leveraged the predominant TCP/IP network transport protocol. This was the genesis of NVMe/TCP. Lightbits Labs, working in collaboration with Facebook, Intel, and other industry leaders, is extending the NVMe-oF standard to support TCP/IP.
The approach is complementary to RDMA fabrics, with the disaggregation of NVMe/TCP offering the advantage of being simple and highly efficient. TCP is ubiquitous, scalable, reliable—ideal for short-lived connections and container-based applications. Additionally, migrating to shared flash storage with NVMe/TCP does not require changes to the data center network infrastructure. No infrastructure changes should mean a relatively easy deployment across the data center. After all, nearly all data center networks are designed to carry TCP/IP.
Furthermore, the broad industry collaboration now taking place on the NVMe/TCP protocol makes it likely that NVMe/TCP will become a broad ecosystem that supports a wide range of operating system and Network Interface Cards (NICs). For instance, the NVMe/TCP Linux drivers are a natural match for the Linux kernel. They use the standard Linux networking stack and NICs without any modifications. As a result, NVMe is showing great promise as a new protocol. It is suitable for hyperscale data centers, as one of many use case examples, because it is easy to deploy with no changes to the underlying network infrastructure.
The NVMe specification recognizes the highly parallelized access to solid-state storage. There is no single bottleneck. Thus, I/O to an NVMe “namespace,” the equivalent of the SCSI logical unit, can be performed via 64,000 queues with up to 64,000 commands per queue. This parallel access is a critical architectural development because many queues eliminate the contention that is inherent in the SCSI model.
iSCSI vs. NVMe/TCP
Comparing NVMe/TCP to iSCSI, the following benefits emerge:
- Extends NVMe across the entire data center using simple and efficient TCP/IP fabric
- Enables disaggregation across a data center’s availability zones and regions
- Leverages TCP/IP transport to bring low average and tail latencies to a highly parallel NVMe software stack
- Requires no changes to the networking infrastructure or application servers
- Delivers a high-performance NVMe-oF solution that has the same performance and latency as Direct Attached SSDs (DAS)
- Uses an efficient and streamlined block storage software stack optimized for NVMe and existing data centers
- Provides parallel access to storage optimized for today’s multi-core application/client servers
- Supports a standards-based solution that will have industry-wide support
Comparing performance of NVMe/TCP with iSCSI
How does NVMe/TCP stack up against iSCSI in terms of performance? Lightbits conducted a side-by-side comparison of the two technologies to arrive at benchmark results. The test used the industry-standard flexible input/output (FIO) benchmark tool with identical setups for iSCSI and NVMe/TCP. Specifically, the test featured an initiator which was a dual-socket Intel® Xeon® E5-2648L v4 processor running at 1.8 GHz, with 14 cores per socket and two hyperthreads per core. The target machine had the same socket and core count. It ran a Xeon E5-2650L v4 processors running at 1.7 GHz. The SSDs in each target were Intel DC P3520, 450 GB.
The test had all I/Os performed via Mellanox ConnectX-4 100GbE NICs on both machines. Both initiator and target machines ran Linux 4.13.0-rc1 kernel release. For iSCSI, the stock iSCSI initiator in the kernel was used. For NVMe/TCP, the test used an open-source initiator contributed by Lightbits. For iSCSI storage, a 12-drive RAID-0 group was configured, and for Lightbits NVMe/TCP, the exported namespace used the same hardware and the same 12 SSDs.
The benchmark demonstrated a typical multithreaded I/O-bound application. Such applications often open a thread per task, with each thread performing a single I/O (i.e., multiple threads, each with I/O queue depth of 1) at a time. Then, the number of threads was varied to see the I/O scalability as the number of threads increased. Varying the number of threads was a critical test for applications that run on today’s modern servers, which have dozens of CPU cores. The I/O mix was 70% reads and 30% writes.
Before running the benchmarks, all of the SSDs were overwritten entirely to precondition the tests. For iSCSI, the maximum commands per session were set to 2048, and the logical unit queue depth at 1024. Both parameters were the maximum allowed by the implementation. With these settings, the test reflected a reasonable comparison of the two TCP-based protocols with identical hardware and software configurations.
The results: iSCSI reached a peak of 89K IOPS when running 128 threads over two multipathed target connections in active-active mode. This IOPS amount was the maximum I/O performance that such an application can obtain. It is worth noting that multipathing over two target connections was of minimal benefit when compared to a single connection.
It is also quite troubling that above 64 threads, iSCSI incurred average latencies of over a millisecond. This is unacceptable I/O performance for flash drives in today’s massively multi core in hyperscale cloud services. These findings illustrate the inherent performance problem in iSCSI. The single queue, single-threaded model of the protocol limits scalability.
The NVMe/TCP test result showed the benefits of the natively multithreaded architecture. The I/Os for a single volume (NVMe namespace) easily rose to 725K IOPS for this specific application scenario. This is an order of magnitude higher IOPS for the application. Moreover, at 256 threads, there was no performance saturation for the initiator. Importantly, the I/O latencies for NVMe/TCP were much lower than the iSCSI latencies with a similar thread count and much lower IOPS. Even at 256 threads and 725K IOPS, NVMe/TCP latency was still under 500μs.
Why NVMe/TCP is the Right Choice for Cloud Application Acceleration
The growth of SSDs is creating a conflict over performance. The iSCSI single-threaded model is not optimal for connecting compute servers with network-attached SSDs. Instead, NVMe delivers far superior performance characteristics. That said, some NVMe-oF solutions have distinct limitations due to the need to run on specialized components and so forth. NVMe/TCP offers a solution. It enables NVMe over the standard, common TCP/IP network transport protocol. In side-by-side tests, the question of NVMe/TCP vs. iSCSI NVMe is clearly resolved. NVMe/TCP enables IOP performance that is an order of magnitude greater than what is possible with iSCSI.
Additional Resources
Direct Attached Storage (DAS) Disadvantages & Alternatives
Cloud-Native Storage for Kubernetes
Disaggregated Storage
Ceph Storage
Persistent Storage
Kubernetes Storage
Edge Cloud Storage
NVMe® over TCP