[Illumos-team] IO Stalls

Grant Albitz Grant at schultztechnology.com
Fri Mar 8 17:05:01 UTC 2013


Garrett,

I will look into your suggestions. I ended up trying all of the hardware on windows 2012 running starwind. I am not experiencing any symptoms so far but it has only been a few hours (the performance is actually better as well). I may revisit illumos in a few days (there are a few limitations of the starwind free version). I am currently running the server with hardware raid in windows 2012. I do not know if that somehow masks my problems, but at this point and time I think I can conclude with reasonable certainty that it is not a hardware issue. Which makes me wonder are the backplane/expander issues specific to illumos/solaris. Obviously under windows I am able to run dell raid drivers that are likely specifically written to work with the dell backplane, or at least be aware of it. Dell has "certified drives" but I am not using them. They wanted $6500 for 1 400gb ssd. I really wanted to use illumos but obviously the reliability is a major factor. It took 12 tries to storage vmotion my last guest off of the system yesterday, the io would freeze and crash the server that frequently.



Grant Albitz | Vice President of Information Technology
MCITP: Enterprise Administrator | MCITP: Server Administrator
Schultz Technology Solutions, LLC | 3117 West Ridge Pike
Pottstown, PA 19464 | Office 610-495-6204 | Fax 610-495-6205
grant at schultztechnology.com<mailto:grant at schultztechnology.com>| www.schultztechnology.com

[cid:image001.jpg at 01CE1BF4.EECDADB0]<http://www.facebook.com/home.php?#!/pages/Schultz-Technology-Solutions-LLC/81105119262?v=info>

[cid:image002.jpg at 01CE1BF4.EECDADB0]<http://twitter.com/SchultzTech>

[cid:image003.jpg at 01CE1BF4.EECDADB0]<http://www.linkedin.com/company/schultz-technology-solutions-llc>

[cid:image004.jpg at 01CE1BF4.EECDADB0]<http://www.bbb.org/washington-dc-eastern-pa/business-reviews/computers-networks/schultz-technology-solutions-llc-in-pottstown-pa-192801864/>



From: Garrett D'Amore [mailto:garrett.damore at gmail.com] On Behalf Of Garrett D'Amore
Sent: Friday, March 8, 2013 11:37 AM
To: Grant Albitz
Cc: illumos-team at openindiana.org
Subject: Re: [Illumos-team] IO Stalls


On Mar 7, 2013, at 7:28 AM, Grant Albitz <Grant at schultztechnology.com<mailto:Grant at schultztechnology.com>> wrote:



My symptoms are similar to https://www.illumos.org/issues/1069 except i do not feel the issue is caused by a single faulty drive. It appears that bug is very old with no resolution.



I have been chasing an issue with my openindiana host for some time. It is stable for a few weeks but then I find it rebooted with no kernel errors.

I am using it as an iscsi target for a vmware environment. Today it failed repeatedly when I was trying to perform a storage vmotion. Since I was able to look at the issue when it was occurring I did make a few discoveries. I found the following:

iostat showed a device with 100 %b. During this time no io was being performed on any of the other disks (the server stopped all IO waiting for this device it appears). The first time this occured it was disk 15. I went down to the server and pulled drive 15 and re-inserted. All IO resumed, including disk 15. No resilver takes place and there is no data loss. I have since witnessed this on many other drives so although a bad drive is an easy answer I feel it's bigger than that. If I do not pull the drives the server eventually reboots on its own after about 5 minutes of no disk activity. It usually then reboots to the perc h310 bios screen where it hangs reporting that no disks were found. It suggests a cold start, and that does resolve the issue. I have seen similar behavior with a perc h710 so I do not believe it is the card itself. The symptom does not crop up under normal io, but intense io such as a vmoition causes this.

It is a dell server 720xd with 24 drives in the front 2 in the back, the dell system reports the following BEFORE I physically pull the drive:

Log Sequence Number: 1748
Detailed Description:
The physical device was reset. This is a normal part of operations and is not a cause for concern.

I have replaced the backplane in the server and the problem is still happening, it does not always happen to disk 15, I have seen it happen on disk 5, 11, 15, 23, 25, just today. Disk 25 is on a different backplane then 0-24

Since I have swapped the backplane I am down to two different issues.

1. it may just be the firmware of the ssd (samsung 840 pro). they are not "approved" by dell so there may be a compatibility issue, the same could be said for the backplane firmware.

2. The real reason I am sending this, I am wondering is there any OS config related to the drive that could cause this? I am not sure under what circumstance the drive is reset, but I believe the OS could be doing it due to the 100 %b. Physically reseating the drive does resolve the issue. If the problem goes un noticed the system eventually restarts itself abruptly. The drives are presented to the perc h310 as jbod so I am not sure special "dell specific" instructions it would try to issue that would cause non dell certified drives to really be the cause. If I can find a way to power the drives in the chassis without the backplane I may try that. I may also try and get a few dell ssds for testing.



SATA drives have a certain pathological behavior when they generate errors.  I'm not familiar with the backplane topology in the 720, but if there is an expander, errors followed by SCSI resets can cause a cascade failure scenario.

It may help to set sd_io_time in /etc/system (e.g. "set sd:sd_io_time =  10") to a lower value.  It defaults to 60, which means that "sd" takes up to a minute to timeout an IO.   This may actually make it worse in some respects as well, because of the following:

That said, SSDs on fast busses should never see these I/O timeouts; if they occur it s a sign of a firmware or hardware bug.  But errors on SCSI busses that cause bus-wide resets can account for this sort of a situation.  You can monitor for calls to scsi_reset() using DTrace to check for this.  In particular, we know that if the 2nd argument to scsi_reset is RESET_ALL, i.e. 0, then the behavior is devastating to all other devices on the shared bus.  RESET_TARGET = 1 and RESET_LUN = 3, whether these are tragic really depends on the HBA in question -- some HBAs translate these to bus-wide resets as well.

Lowering sd_io_time will shorten the time it takes (in theory, if the HBA honors it) to discover that an IO is stuck.  It *may* also increase the likelihood of hitting the pathology of cascading reset/errors.  But I *think* that it will probably still help by reducing the time between retry attempts, improving the likelihood of "muddling on" in a shorter time.

If you have the ability to ensure that you have a configuration where each SATA drive is cabled directly to a SATA initiator port (without an expander in the way) then it will likely have an immediately positive impact for you, particularly if you have flaky devices/firmware that are causing the timeouts to occur.  It will avoid the entire pathology that I'm describing above.

Note that as far as we have seen, pure SAS configurations seem free from this pathology, as well as SATA configurations that do not use expanders.  (SATA drives are great for single chassis direct-to-SATA-port-on-motherboard or HBA configurations.)

            - Garrett
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://openindiana.org/pipermail/illumos-team/attachments/20130308/8637468d/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.jpg
Type: image/jpeg
Size: 1737 bytes
Desc: image001.jpg
URL: <http://openindiana.org/pipermail/illumos-team/attachments/20130308/8637468d/attachment-0004.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image002.jpg
Type: image/jpeg
Size: 1627 bytes
Desc: image002.jpg
URL: <http://openindiana.org/pipermail/illumos-team/attachments/20130308/8637468d/attachment-0005.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image003.jpg
Type: image/jpeg
Size: 1734 bytes
Desc: image003.jpg
URL: <http://openindiana.org/pipermail/illumos-team/attachments/20130308/8637468d/attachment-0006.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image004.jpg
Type: image/jpeg
Size: 1228 bytes
Desc: image004.jpg
URL: <http://openindiana.org/pipermail/illumos-team/attachments/20130308/8637468d/attachment-0007.jpg>


More information about the Illumos-team mailing list