[Bugs-team] OI HBA bug or hardware bug?
Þór Sigurðsson
thor at belgingur.is
Sun Oct 21 13:35:45 UTC 2012
Hi all,
We have been debugging for 8 months now a problem with one setup of ours that havn't been able to use for all of this time because of very odd HBA problems. We can't resolve if this is a hardware or software bug, so I'd like to throw the ball to whomever who wishes to give us some constructive input.
Here's the situation:
One SuperMicro Server (SuperServer 5016i-URF)
One Dual-Port Host Controller LSI SAS 9200-8e
One 45-drive SuperMicro JBOD storage unit SC847J
This setup has been extensively tested with:
36 2TB Seagate "consumer level" hard drives
and
16 500GB Western Digital "RED" Enterprise level hard drives
(not together, only as separately tested units)
The history starts with the purchase of the three units and the 36 consumer disks. These are installed in our cabinet and OpenIndiana installed on it (an early version - we started with oi148).
The 36 disks made up one pool of raidz2 that we used without problems for about 6 months.
Then we went through some cleanup in our computer room, so I moved the unit from one cabinet to another. When the unit came up, we started to see errors on the disks. Not immediately, but progressively. In a couple of days, the zpool was degraded, and about a week, it faulted.
We immediately started moving data off of the box (denying all access except for one local instance, it took almost 2 weeks to copy the 33.5T data over to another set of boxes - old Promise boxes on Linux).
Since then, we:
Replaced the Host Controller
Replaced the Host
Replaced the 8087 cable (twice) - small change in disturbance each time
Identified the problem as being on the front backplane and replaced that.
Saw the problem move to the rear backplane, so we replaced that as well.
Saw the problem resurface on the rear backplane.
Had the JBOD completely replaced - and the same problem resurface with the new JBOD
When we look at the zpool, we may or may not see errors (with zpool -v poolname) but when looking at the disks (with iostat -En | grep ^c2 ) we see different things... like this:
root at uni:/bigvol# iostat -En | grep ^c6
c6t5000C5003FE63911d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
c6t5000C5003FE93E52d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
c6t5000C5003FA4CE93d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
c6t5000C5003FE670DAd0 Soft Errors: 0 Hard Errors: 39 Transport Errors: 149
c6t5000C5003FE6672Fd0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
c6t5000C5003FE72690d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
c6t5000C5003FEAF5F0d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
c6t5000C5003FE7D200d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
c6t5000C5003FE7C230d0 Soft Errors: 0 Hard Errors: 8 Transport Errors: 12
c6t5000C5003FD2F2B1d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
c6t5000C5003F9D5891d0 Soft Errors: 0 Hard Errors: 13 Transport Errors: 25
c6t5000C5003FE77AC6d0 Soft Errors: 0 Hard Errors: 35 Transport Errors: 136
c6t5000C5003FE7CB48d0 Soft Errors: 0 Hard Errors: 25 Transport Errors: 84
c6t5000C5003FEAA349d0 Soft Errors: 0 Hard Errors: 20 Transport Errors: 81
c6t5000C5003FE7C8DAd0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
c6t5000C5003F5FDAAAd0 Soft Errors: 0 Hard Errors: 11 Transport Errors: 15
c6t5000C5003FB7EFCEd0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
c6t5000C5003FEDB811d0 Soft Errors: 0 Hard Errors: 33 Transport Errors: 104
c6t5000C5003FE80692d0 Soft Errors: 0 Hard Errors: 36 Transport Errors: 144
c6t5000C5003F954492d0 Soft Errors: 0 Hard Errors: 12 Transport Errors: 29
c6t5000C5003FEDBBA5d0 Soft Errors: 0 Hard Errors: 26 Transport Errors: 94
c6t5000C5003FEDC3B6d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
c6t5000C5003FEDC336d0 Soft Errors: 0 Hard Errors: 13 Transport Errors: 25
c6t5000C5003FE77AB7d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
c6t5000C5003FE77C67d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
c6t5000C5003FECA938d0 Soft Errors: 0 Hard Errors: 40 Transport Errors: 141
c6t5000C5003FEDC339d0 Soft Errors: 0 Hard Errors: 17 Transport Errors: 30
c6t5000C5003FBB0729d0 Soft Errors: 0 Hard Errors: 17 Transport Errors: 86
c6t5000C5003FE7D39Ad0 Soft Errors: 0 Hard Errors: 35 Transport Errors: 127
c6t5000C5003FE9E1BCd0 Soft Errors: 0 Hard Errors: 29 Transport Errors: 140
c6t5000C5003F95936Fd0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
c6t5000C5003FE63F3Fd0 Soft Errors: 0 Hard Errors: 7 Transport Errors: 17
c6t5000C5003F8C8B43d0 Soft Errors: 0 Hard Errors: 34 Transport Errors: 108
c6t5000C5003FEDB058d0 Soft Errors: 0 Hard Errors: 41 Transport Errors: 154
These errors are after only creating the zpool - nothing else...
The odd thing is - we have an identical, only one year older, system running and it does not fail.
We have already thrown the ball in to SuperMicro since the time invested in this has become ridiculous - the savings of buying cheaper hardware are long gone.
If anyone has any idea, I'd love to hear it - however small it may be, it may still prove significant in solving this..
Bestu kveðjur / Best regards
Þór Sigurðsson Thor Sigurdsson
Tölvunarfræðingur M.Sc. Computer Scientist M.Sc.
Reiknistofa í veðurfræði - Belgingur Institute of Meteorological Research
http://www.belgingur.is =*= http://www.riv.is
More information about the Bugs-team
mailing list