<html><head><meta http-equiv="Content-Type" content="text/html charset=windows-1252"><base href="x-msg://416/"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><br><div><div>On Mar 7, 2013, at 7:28 AM, Grant Albitz <<a href="mailto:Grant@schultztechnology.com">Grant@schultztechnology.com</a>> wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div lang="EN-US" link="#0563C1" vlink="#954F72" style="font-family: Helvetica; font-size: medium; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: 2; text-align: -webkit-auto; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; "><div class="WordSection1" style="page: WordSection1; "><div style="margin: 0in 0in 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif; "><o:p> </o:p></div><div style="margin: 0in 0in 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif; ">My symptoms are similar to<span class="Apple-converted-space"> </span><a href="https://www.illumos.org/issues/1069" style="color: rgb(149, 79, 114); text-decoration: underline; ">https://www.illumos.org/issues/1069</a><span class="Apple-converted-space"> </span>except i do not feel the issue is caused by a single faulty drive. It appears that bug is very old with no resolution.<o:p></o:p></div><div style="margin: 0in 0in 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif; "><o:p> </o:p></div><div style="margin: 0in 0in 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif; "><span style="color: rgb(31, 73, 125); "> </span></div><div style="margin: 0in 0in 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif; "><o:p> </o:p></div><div style="margin: 0in 0in 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif; background-color: white; "><span style="font-size: 10pt; font-family: Tahoma, sans-serif; ">I have been chasing an issue with my openindiana host for some time. It is stable for a few weeks but then I find it rebooted with no kernel errors.<o:p></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif; background-color: white; "><span style="font-size: 10pt; font-family: Tahoma, sans-serif; "> <o:p></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif; background-color: white; "><span style="font-size: 10pt; font-family: Tahoma, sans-serif; ">I am using it as an iscsi target for a vmware environment. Today it failed repeatedly when I was trying to perform a storage vmotion. Since I was able to look at the issue when it was occurring I did make a few discoveries. I found the following:<o:p></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif; background-color: white; "><span style="font-size: 10pt; font-family: Tahoma, sans-serif; "> <o:p></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif; background-color: white; "><span style="font-size: 10pt; font-family: Tahoma, sans-serif; ">iostat showed a device with 100 %b. During this time no io was being performed on any of the other disks (the server stopped all IO waiting for this device it appears). The first time this occured it was disk 15. I went down to the server and pulled drive 15 and re-inserted. All IO resumed, including disk 15. No resilver takes place and there is no data loss. I have since witnessed this on many other drives so although a bad drive is an easy answer I feel it’s bigger than that. If I do not pull the drives the server eventually reboots on its own after about 5 minutes of no disk activity. It usually then reboots to the perc h310 bios screen where it hangs reporting that no disks were found. It suggests a cold start, and that does resolve the issue. I have seen similar behavior with a perc h710 so I do not believe it is the card itself. The symptom does not crop up under normal io, but intense io such as a vmoition causes this.<o:p></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif; background-color: white; "><span style="font-size: 10pt; font-family: Tahoma, sans-serif; "> <o:p></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif; background-color: white; "><span style="font-size: 10pt; font-family: Tahoma, sans-serif; ">It is a dell server 720xd with 24 drives in the front 2 in the back, the dell system reports the following BEFORE I physically pull the drive:<o:p></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif; background-color: white; "><span style="font-size: 10pt; font-family: Tahoma, sans-serif; "> <o:p></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif; background-color: white; "><span style="font-size: 10pt; font-family: Tahoma, sans-serif; ">Log Sequence Number: 1748<o:p></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif; background-color: white; "><span style="font-size: 10pt; font-family: Tahoma, sans-serif; ">Detailed Description:<o:p></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif; background-color: white; "><span style="font-size: 10pt; font-family: Tahoma, sans-serif; ">The physical device was reset. This is a normal part of operations and is not a cause for concern.<o:p></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif; background-color: white; "><span style="font-size: 10pt; font-family: Tahoma, sans-serif; "> <o:p></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif; background-color: white; "><span style="font-size: 10pt; font-family: Tahoma, sans-serif; ">I have replaced the backplane in the server and the problem is still happening, it does not always happen to disk 15, I have seen it happen on disk 5, 11, 15, 23, 25, just today. Disk 25 is on a different backplane then 0-24<o:p></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif; background-color: white; "><span style="font-size: 10pt; font-family: Tahoma, sans-serif; "> <o:p></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif; background-color: white; "><span style="font-size: 10pt; font-family: Tahoma, sans-serif; ">Since I have swapped the backplane I am down to two different issues.<o:p></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif; background-color: white; "><span style="font-size: 10pt; font-family: Tahoma, sans-serif; "> <o:p></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif; background-color: white; position: static; z-index: auto; "><span style="font-size: 10pt; font-family: Tahoma, sans-serif; ">1. it may just be the firmware of the ssd (samsung 840 pro). they are not "approved" by dell so there may be a compatibility issue, the same could be said for the backplane firmware.<o:p></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif; background-color: white; "><span style="font-size: 10pt; font-family: Tahoma, sans-serif; "> <o:p></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif; background-color: white; position: static; z-index: auto; "><span style="font-size: 10pt; font-family: Tahoma, sans-serif; ">2. The real reason I am sending this, I am wondering is there any OS config related to the drive that could cause this? I am not sure under what circumstance the drive is reset, but I believe the OS could be doing it due to the 100 %b. Physically reseating the drive does resolve the issue. If the problem goes un noticed the system eventually restarts itself abruptly. The drives are presented to the perc h310 as jbod so I am not sure special “dell specific” instructions it would try to issue that would cause non dell certified drives to really be the cause. If I can find a way to power the drives in the chassis without the backplane I may try that. I may also try and get a few dell ssds for testing.<o:p></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif; "><o:p> </o:p></div><div style="margin: 0in 0in 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif; "><o:p> </o:p></div></div></div></blockquote><div><br></div>SATA drives have a certain pathological behavior when they generate errors.  I'm not familiar with the backplane topology in the 720, but if there is an expander, errors followed by SCSI resets can cause a cascade failure scenario.</div><div><br></div><div>It may help to set sd_io_time in /etc/system (e.g. "set sd:sd_io_time =  10") to a lower value.  It defaults to 60, which means that "sd" takes up to a minute to timeout an IO.   This may actually make it worse in some respects as well, because of the following:</div><div><br></div><div>That said, SSDs on fast busses should never see these I/O timeouts; if they occur it s a sign of a firmware or hardware bug.  But errors on SCSI busses that cause bus-wide resets can account for this sort of a situation.  You can monitor for calls to scsi_reset() using DTrace to check for this.  In particular, we know that if the 2nd argument to scsi_reset is RESET_ALL, i.e. 0, then the behavior is devastating to all other devices on the shared bus.  RESET_TARGET = 1 and RESET_LUN = 3, whether these are tragic really depends on the HBA in question -- some HBAs translate these to bus-wide resets as well.</div><div><br></div><div>Lowering sd_io_time will shorten the time it takes (in theory, if the HBA honors it) to discover that an IO is stuck.  It *may* also increase the likelihood of hitting the pathology of cascading reset/errors.  But I *think* that it will probably still help by reducing the time between retry attempts, improving the likelihood of "muddling on" in a shorter time.</div><div><br></div><div>If you have the ability to ensure that you have a configuration where each SATA drive is cabled directly to a SATA initiator port (without an expander in the way) then it will likely have an immediately positive impact for you, particularly if you have flaky devices/firmware that are causing the timeouts to occur.  It will avoid the entire pathology that I'm describing above.</div><div><br></div><div>Note that as far as we have seen, pure SAS configurations seem free from this pathology, as well as SATA configurations that do not use expanders.  (SATA drives are great for single chassis direct-to-SATA-port-on-motherboard or HBA configurations.)</div><div><br></div><div><span class="Apple-tab-span" style="white-space:pre">        </span>- Garrett</div></body></html>