Understanding NVMe-over-TCP, the simplest NVMe SAN storage

Among the different ports of the NVMe protocol across SAN storage infrastructures, this version directly uses the well-known and inexpensive TCP / IP networks.

NVMe-over-TCP, or NVMe / TCP, is the most recent version of the NVMe-over-Fabrics protocol , aka NVMe-oF. Its particularity is to embed NVMe commands in TCP / IP packages all that is standard.

All versions of NVMe-over-Fabrics allow servers and SAN arrays to communicate through NVMe commands , which are much more suited to Flash storage devices than SCSI commands designed for hard drives. NVMe-over-TCP transfers these commands to the physical Ethernet network, such as NVMe-over-RoCE . But unlike the latter, administrators do not have to configure special switches and network cards servers.

Advantages and disadvantages

NVMe-over-TCP is therefore one of the most economical solutions, since the Ethernet infrastructures are cheaper than the Fiber Channel used in NVMe-over-FC , and the simplest to implement, all the more so. that TCP / IP networks are largely controlled.

Incidentally, because NVMe-over-TCP is naturally routable, servers and their storage arrays can communicate through an existing corporate network, without the need for dedicated switches. This is even possible through the Internet, to record or read data on a storage located elsewhere, typically on a backup site. Or at the headquarters of a company, if the server is in a branch.

Despite these advantages, NVMe-over-TCP also suffers from disadvantages. The most important is that it requires the computing power of the server, which is no longer fully available to run the current applications. Among the processor-intensive TCP operations, there is the calculation of the parity code (checksum) of each packet.

Another disadvantage is that it induces more latency in transfers than other NVMe-over-Fabrics protocols. This problem is due in particular to the need to maintain multiple copies of the data in the streams to bypass packet loss at the routing level. According to several performance tests, an NVMe-over-TCP connection thus displays a higher latency of 10 to 80 microseconds compared to an NVMe-over-RoCE connection on the same communication.

These latency issues are highly dependent on the implementation of the protocol and the type of data being transferred. Analysts expect to see the latency decrease as providers implement the protocol by imagining new optimizations.

It should be noted that the NVMe-over-TCP specification describes a software implementation that is attached to the TCP / IP stack of the server operating system or the array controller. That said, it is technically possible to implement NVMe-over-TCP in hardware accelerators.

How NVMe-over-TCP works

Technically, TCP is used to define how data should be encapsulated in packets, so that it can be transferred through Ethernet links between a server and the controller of a disk array.

The protocol then sends these packets to the recipient, where he reassembles the information. This mechanism ensures that all data is correctly transmitted.

TCP works in conjunction with IP, the Internet Protocol, which defines how packets are addressed and routed. In the end, NVMe-over-TCP is compatible with absolutely all TCP / IP networks, including the Internet.

By convention, a TCP / IP network is divided into four layers: application, transport, network and physics. The application layer corresponds to the NVMe commands, that of transport to TCP, that of the IP network and the physical layer to Ethernet.

In practice, the server opens the connection to the controller of the array by sending a request. When the controller responds, communication is established. The server then sends a PDU, a Protocol Data Unit, which defines a structure for the transfer of data (packets of a certain size, composed in a certain format), but also to control the progress of this transfer (a numbering of packets sent in single file) and obtain information on its status (some control tags in each packet). The controller in turn sends a PDU to tell the server how it will send back data.

In NVMe-over-TCP communications, there are several types of PDUs with information specific to the protocol and its message format and Indian file. In addition, each communication always has data flow: one to send the data, the other to confirm that the exchange went well.

The main suppliers: Lightbits Labs and Solarflare Communications
The first two vendors to implement NVMe-over-TCP in their solutions are Lightbits Labs and Solarflare Communications.

Lightbits Labs chose this protocol because it allows it to offer an elastic storage solution, where it is enough to add disk drawers at will to increase the capacity, without however losing in performance with each addition of a new block. And, most importantly, these storage blocks can be anywhere in the data center, without impacting network traffic.

Solarflare’s solution offers the same benefits as Lightbits, but the vendor produces its own accelerator card to compensate for latency issues and works with SuperMicro to offer turnkey appliances.

Article originally published on TechTarget France, LeMagIT.

Additional Resources

NVMe over TCP
Kubernetes Persistent Storage
Edge Cloud Storage
Ceph Storage
Disaggregated Storage
NVMe Storage Explained: NVMe Shared Storage, NVMe-oF, and More