[OpenIndiana-discuss] Recommendations for fast storage (OpenIndiana-discuss Digest, Vol 33, Issue 20)

Doug Hughes doug at will.to
Tue Apr 16 16:50:37 UTC 2013


some of these points are a bit dated. Allow me to make some updates. I'm sure that you are aware that most 10gig switches these days are cut through and not store and forward. That's Arista, HP, Dell Force10, Mellanox, and IBM/Blade. Cisco has a mix of things, but they aren't really in the low latency space. The 10g and 40g port to port forwarding is in nanoseconds. buffering is mostly reserved to carrier operations anymore, and even there it is becoming less common because of the toll it causes to things like IPVideo and VOIP. Buffers are good for web farms, still, and to a certain extent storage servers or WAN links where there is a high degree of contention from disparate traffic.
  
At a physical level, the signalling of IB compared to Ethernet (10g+) is very similar, which is why Mellanox can make a single chip that does 10gbit 40gbit, and QDR and FDR infiniband on any port.
 there are also a fair number of vendors that support RDMA in ethernet NIC now, like SolarFlare with Onboot technology.

The main reason for lowest achievable latency is higher speed. Latency is roughly equivalent to the inversion of bandwidth.  But, the higher levels of protocols that you stack on top contribute much more than the hardware theoretical minimums or maximums. TCP/IP is a killer in terms of adding overhead. That's why there are protocols like ISER, SRP, and friends. RDMA is much faster than the kernel overhead induced by TCP session setups and other host side user/kernel boundaries and buffering. PCI latency is also higher than the port to port latency on a good 10g switch, nevermind 40 or FDR infiniband.

There is even a special layer that you can write custom protocols to on Infiniband called Verbs for lowering latency further.

Infiniband is inherently a layer1 and 2 protocol, and the subnet manager (software) is resposible for setting up all virtual circuits (routes between hosts on the fabric) and rerouting when a path goes bad. Also, the link aggregation, as you mention, is rock solid and amazingly good. Auto rerouting is fabulous and super fast. But, you don't get layer3. TCP over IB works out of the box, but adds large overhead. Still, it does make it possible that you can have IB native and IP over IB with gateways to a TCP network with a single cable. That's pretty cool.


Sent from my android device.

-----Original Message-----
From: "Edward Ned Harvey (openindiana)" <openindiana at nedharvey.com>
To: Discussion list for OpenIndiana <openindiana-discuss at openindiana.org>
Sent: Tue, 16 Apr 2013 10:49 AM
Subject: Re: [OpenIndiana-discuss] Recommendations for fast storage (OpenIndiana-discuss Digest, Vol 33, Issue 20)

> From: Bob Friesenhahn [mailto:bfriesen at simple.dallas.tx.us]
> 
> It would be difficult to believe that 10Gbit Ethernet offers better
> bandwidth than 56Gbit Infiniband (the current offering).  The swiching
> model is quite similar.  The main reason why IB offers better latency
> is a better HBA hardware interface and a specialized stack.  5X is 5X.

Put another way, the reason infiniband is so much higher throughput and lower latency than ethernet is because the switching (at the physical layer) is completely different from ethernet, and messages are passed directly from user-level to user-level on remote system ram via RDMA, bypassing the OSI layer model and other kernel overhead.  I read a paper from vmware, where they implemented RDMA over ethernet and doubled the speed of vmotion (but still not as fast as infiniband, by like 4x.)

Beside the bypassing of OSI layers and kernel latency, IB latency is lower because Ethernet switches use store-and-forward buffering managed by the backplane in the switch, in which a sender sends a packet to a buffer on the switch, which then pushes it through the backplane, and finally to another buffer on the destination.  IB uses cross-bar, or cut-through switching, in which the sending host channel adapter signals the destination address to the switch, then waits for the channel to be opened.  Once the channel is opened, it stays open, and the switch in between is nothing but signal amplification (as well as additional virtual lanes for congestion management, and other functions).  The sender writes directly to RAM on the destination via RDMA, no buffering in between.  Bypassing the OSI layer model.  Hence much lower latency.

IB also has native link aggregation into data-striped lanes, hence the 1x, 2x, 4x, 16x designations, and the 40Gbit specifications.  Something which is quasi-possible in ethernet via LACP, but not as good and not the same.  IB guarantees packets delivered in the right order, with native congestion control as compared to ethernet which may drop packets and TCP must detect and retransmit...  

Ethernet includes a lot of support for IP addressing, and variable link speeds (some 10Gbit, 10/100, 1G etc) and all of this asynchronous.  For these reasons, IB is not a suitable replacement for IP communications done on ethernet, with a lot of variable peer-to-peer and broadcast traffic.  IB is designed for networks where systems want to establish connections to other systems, and those connections remain mostly statically connected.  Primarily clustering & storage networks.  Not primarily TCP/IP.


_______________________________________________
OpenIndiana-discuss mailing list
OpenIndiana-discuss at openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


More information about the OpenIndiana-discuss mailing list