[OpenIndiana-discuss] FW: IO Stalls

Grant Albitz Grant at schultztechnology.com
Thu Mar 7 15:41:32 UTC 2013


Sorry I have sent this message to a few different lists but I wasn't sure which was the most relevant.

My symptoms are similar to https://www.illumos.org/issues/1069 except i do not feel the issue is caused by a single faulty drive. It appears that bug is very old with no resolution.  I am using an updated version of mr_sas to give support for the dell h310 provided by dan mcdonald: https://www.illumos.org/issues/3500#change-9324




I have been chasing an issue with my openindiana host for some time. It is stable for a few weeks but then I find it rebooted with no kernel errors.

I am using it as an iscsi target for a vmware environment. Today it failed repeatedly when I was trying to perform a storage vmotion. Since I was able to look at the issue when it was occurring I did make a few discoveries. I found the following:

iostat showed a device with 100 %b. During this time no io was being performed on any of the other disks (the server stopped all IO waiting for this device it appears). The first time this occured it was disk 15. I went down to the server and pulled drive 15 and re-inserted. All IO resumed, including disk 15. No resilver takes place and there is no data loss. I have since witnessed this on many other drives so although a bad drive is an easy answer I feel it's bigger than that. If I do not pull the drives the server eventually reboots on its own after about 5 minutes of no disk activity. It usually then reboots to the perc h310 bios screen where it hangs reporting that no disks were found. It suggests a cold start, and that does resolve the issue. I have seen similar behavior with a perc h710 so I do not believe it is the card itself. The symptom does not crop up under normal io, but intense io such as a vmoition causes this.

It is a dell server 720xd with 24 drives in the front 2 in the back, the dell system reports the following BEFORE I physically pull the drive:

Log Sequence Number: 1748
Detailed Description:
The physical device was reset. This is a normal part of operations and is not a cause for concern.

I have replaced the backplane in the server and the problem is still happening, it does not always happen to disk 15, I have seen it happen on disk 5, 11, 15, 23, 25, just today. Disk 25 is on a different backplane then 0-24

Since I have swapped the backplane I am down to two different issues.

1. it may just be the firmware of the ssd (samsung 840 pro). they are not "approved" by dell so there may be a compatibility issue, the same could be said for the backplane firmware.

2. The real reason I am sending this, I am wondering is there any OS config related to the drive that could cause this? I am not sure under what circumstance the drive is reset, but I believe the OS could be doing it due to the 100 %b. Physically reseating the drive does resolve the issue. If the problem goes un noticed the system eventually restarts itself abruptly. The drives are presented to the perc h310 as jbod so I am not sure special "dell specific" instructions it would try to issue that would cause non dell certified drives to really be the cause. If I can find a way to power the drives in the chassis without the backplane I may try that. I may also try and get a few dell ssds for testing.





More information about the OpenIndiana-discuss mailing list