[OpenIndiana-discuss] Sudden ZFS performance issue

Fri Jul 5 16:08:44 UTC 2013

Good morning,

I have a weird problem with two of the 15+ OpenSolaris storage servers in our
environment. All the Nearline servers are essentially the same. Supermicro
X9DR3-F based server, Dual E5-2609's, 64GB memory, Dual 10Gb SFP+ NICs, LSI
9200-8e HBA, Supermicro CSE-826E26-R1200LPB storage arrays and Seagate
enterprise 2TB SATA or SAS drives (not mixed within a server). Root, l2ARC and
ZIL are all on Intel SSD (SLC series 313 for ZIL, MLC 520 for L2ARC and MLC 330
for boot)

The volumes are built out of 9 drive Z1 groups, ashift is set to 9 (which is
supposed to appropiate for the enterprise seagates). The pools are large
(120-130TB) but are only between 27 and 32% full. Each server serves an iSCSI
(Comstar) and an CIFS (in kernel server) volume of the same pool. I realize this
is not optimal from a recovery/resilver/rebuild standpoint but the servers are
replicated and the data is easily rebuildable.

Initially these servers did great for several months, while certainly no speed
demons, 300+ MB/sec for sequential read/writes was not a problem. Several weeks
ago, literally overnight, replication times went through the roof for one
server. Simple testing showed that reading from the pool would no longer go over
25MB/s. Even a scrub that used to run at 400+ MB/sec is now crawling along at
below 40MB/s.

Sometime yesterday the second server started to exhibit the exact same
behaviour. This one is used even less (it's our D2D2T server) and data is
written to it at night and read during the day to be written to tape.

I've exhausted all I know and I'm at a loss. Does anyone have any ideas of what
to look at, or do any obvious reasons for this behaviour jump out from the
configuration above?

Thanks

W