[OpenIndiana-discuss] ZFS read speed(iSCSI)

Fri Jun 7 19:06:59 UTC 2013

Comment below

On 2013-06-07 20:42, Heinrich van Riel wrote:
> One sec apart cloning 150GB vm from a datastore on EMC to OI.
>
> alloc free read write read write
> ----- ----- ----- ----- ----- -----
> 309G 54.2T 81 48 452K 1.34M
> 309G 54.2T 0 8.17K 0 258M
> 310G 54.2T 0 16.3K 0 510M
> 310G 54.2T 0 0 0 0
> 310G 54.2T 0 0 0 0
> 310G 54.2T 0 0 0 0
> 310G 54.2T 0 10.1K 0 320M
> 311G 54.2T 0 26.1K 0 820M
> 311G 54.2T 0 0 0 0
> 311G 54.2T 0 0 0 0
> 311G 54.2T 0 0 0 0
> 311G 54.2T 0 10.6K 0 333M
> 313G 54.2T 0 27.4K 0 860M
> 313G 54.2T 0 0 0 0
> 313G 54.2T 0 0 0 0
> 313G 54.2T 0 0 0 0
> 313G 54.2T 0 9.69K 0 305M
> 314G 54.2T 0 10.8K 0 337M
...
Were it not for your complaints about link resets and "unusable"
connections, I'd say this looks like a normal behavior for async
writes: they get cached up, and every 5 sec you have a transaction
group (TXG) sync which flushes the writes from cache to disks.

In fact, the picture still looks like that, and possibly is the
reason for hiccups.

The TXG sync may be an IO intensive process, which may block or
delay many other system tasks; previously when the interval
defaulted to 30 sec we got unusable SSH connections and temporarily
"hung" disk requests on the storage server every half a minute when
it was really busy (i.e. initial filling up with data from older
boxes). It cached up about 10 seconds worth of writes, then spewed
them out and could do nothing else. I don't think I ever saw network
connections timing out or NICs reporting resets due to this, but I
wouldn't be surprised if this were the cause for your case, though
(i.e. disk IO threads preempting HBA/NIC threads for too long somehow, 
making the driver very puzzled about staleness state of its card).

At the very least, TXG syncs can be tuned by two knobs: the time
limit (5 sec default) and the size limit (when the cache is "this"
full, begin the sync to disk). The latter is a realistic figure that
can allow you to sync in shorter bursts - with less interruptions
to smooth IO and process work.

A somewhat related tunable is the number of requests that ZFS would
queue up for a disk. Depending on its NCQ/TCQ abilities and random
IO abilities (HDD vs. SSD), long or short queues may be preferable.
See also: 
http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Device_I.2FO_Queue_Size_.28I.2FO_Concurrency.29

These tunables can be set at runtime with "mdb -K", as well as in
the /etc/system file to survive reboots. One of our storage boxes
has these example values in /etc/system:

*# default: flush txg every 5sec (may be max 30sec, optimize
*# for 5 sec writing)
set zfs:zfs_txg_synctime = 5

*# Spool to disk when the ZFS cache is 0x18000000 (384Mb) full
set zfs:zfs_write_limit_override = 0x18000000
*# ...for realtime changes use mdb.
*# Example sets 0x18000000 (384Mb, 402653184 b):
*# echo zfs_write_limit_override/W0t402653184 | mdb -kw

*# ZFS queue depth per disk
set zfs:zfs_vdev_max_pending = 3

HTH,
//Jim Klimov