[OpenIndiana-discuss] Sudden ZFS performance issue

Fri Jul 5 18:33:44 UTC 2013

On 05/07/2013 19:09, Irek Szczesniak wrote:
> On Fri, Jul 5, 2013 at 8:00 PM, Saso Kiselkov <skiselkov.ml at gmail.com> wrote:
>> On 05/07/2013 17:08, wim at vandenberge.us wrote:
>>> Good morning,
>>>
>>> I have a weird problem with two of the 15+ OpenSolaris storage servers in our
>>> environment. All the Nearline servers are essentially the same. Supermicro
>>> X9DR3-F based server, Dual E5-2609's, 64GB memory, Dual 10Gb SFP+ NICs, LSI
>>> 9200-8e HBA, Supermicro CSE-826E26-R1200LPB storage arrays and Seagate
>>> enterprise 2TB SATA or SAS drives (not mixed within a server). Root, l2ARC and
>>> ZIL are all on Intel SSD (SLC series 313 for ZIL, MLC 520 for L2ARC and MLC 330
>>> for boot)
>>>
>>> The volumes are built out of 9 drive Z1 groups, ashift is set to 9 (which is
>>> supposed to appropiate for the enterprise seagates). The pools are large
>>> (120-130TB) but are only between 27 and 32% full. Each server serves an iSCSI
>>> (Comstar) and an CIFS (in kernel server) volume of the same pool. I realize this
>>> is not optimal from a recovery/resilver/rebuild standpoint but the servers are
>>> replicated and the data is easily rebuildable.
>>>
>>> Initially these servers did great for several months, while certainly no speed
>>> demons, 300+ MB/sec for sequential read/writes was not a problem. Several weeks
>>> ago, literally overnight, replication times went through the roof for one
>>> server. Simple testing showed that reading from the pool would no longer go over
>>> 25MB/s. Even a scrub that used to run at 400+ MB/sec is now crawling along at
>>> below 40MB/s.
>>>
>>> Sometime yesterday the second server started to exhibit the exact same
>>> behaviour. This one is used even less (it's our D2D2T server) and data is
>>> written to it at night and read during the day to be written to tape.
>>>
>>> I've exhausted all I know and I'm at a loss. Does anyone have any ideas of what
>>> to look at, or do any obvious reasons for this behaviour jump out from the
>>> configuration above?
>>
>> Isn't iostat -Exn reporting some transport errors? Smells like a drive
>> gone bad and forcing retries, which would cause about a 10x decrease in
>> performance. Just a guess, though.
> 
> Why should a retry require a 10x decrease in performance? A proper
> design would surely do retries in parallel to other operations
> (Reiser4 and btrfs do it) up to a certain amount of
> failures-in-flight.

Going off on a hypothetical tangent here (since I don't now if this is
really the case here), transport errors *do* cause significant
performance degradation to the affected drives and there is relatively
little any higher-level application can do to fix that. Filesystems
never talk to drives directly. Instead, they go through a driver
subsystem and an interface bus (such as SAS). That layer can
transparently re-issue failed commands if the condition is recoverable
(e.g. bad CRC in SAS command response). Due to the effects of queueing,
pipelining and caching, retries of commands cause a lot of latency
higher up the stack. Higher layers in the OS (e.g. the filesystem) will
never spot these latency bubbles. All they will see is a particular
read() or write() call taking a lot of time to complete. These effects
aren't just constrained to disk. They affect every system which heavily
relies on operation pipelining with the potential for pipeline stalls.

iostat talks to the driver layer to interrogate the low-level transport
infrastructure. If a particular device is taking a lot of time to
respond to commands, but still completes them after all, that in itself
isn't reason enough to take it offline automatically. The transport
failures can be transient (partly inserted cable, interference) or
affect many more devices on a common bus (e.g. failing SAS expander).
Taking a whole portion of the device tree offline too early could have
catastrophic consequences to service availability.

So I hope as you can see, the situation isn't so cut and dry.

Cheers,
-- 
Saso