[OpenIndiana-discuss] disconnected drives, how to avoid in the future?

Tue Jan 10 16:48:02 UTC 2012

 > From: Jason Matthews <jason at broken.net>
 > Date: Tue, 10 Jan 2012 08:26:08 -0800
 > 
 > 
 > you can adjust the disk timeouts in solaris. 

Here's an article on how to do that, although it ends with the author
adding this comment "However in testing with failing harddrives (on
mpt_sas anyway), we see that the sd timeouts are completely ignored so
my entire post above is moot!"

  http://blogs.everycity.co.uk/alasdair/2011/05/adjusting-drive-timeouts-with-mdb-on-solaris-or-openindiana/

I haven't tested this, so does it work or not (in OpenIndiana)?

Martin

 > there are two schools of thought here:
 > 
 > 1) accomodate the extremely long timeouts of cinsumer drives and
 > let the drive decide whether to report an error back (fail itself
 > out)
 > 
 > 2) set the time outs very narrowly and be aggressive in letting zfs
 > fail out disks.
 > 
 > i generally go with option 2. 
 > 
 > Sent from Jasons' hand held
 > 
 > On Jan 10, 2012, at 7:13 AM, Maurilio Longo <maurilio.longo at libero.it> wrote:
 > 
 > > Geoff,
 > > 
 > > I've hit this problem several times in the past, with OpenSolaris
 > > and then with OpenIndiana.
 > > 
 > > There are, to my knowledge, no available solutions, it is so by
 > > design!
 > > 
 > > If a disk stops responding the pool waits until after it responds
 > > again (sometimes pulling it out of its slot and then reinserting
 > > the disk causes a reset of the link and it starts working again).
 > > 
 > > I was not able to assess what happens if I set failmode to continue.
 > > 
 > > I think it could be no better since you still cannot write to the pool.
 > > 
 > > This is IMHO the biggest problem of ZFS, in that I cannot
 > > instruct it to stop using a failed device if it has some level of
 > > redundancy still available.
 > > 
 > > Wait is OK only if an entire vdev stops responding, not if a disk
 > > in a vdev with redundancy has problems either fatal or
 > > transitory.
 > > 
 > > Best regards.
 > > 
 > > Maurilio.
 > > 
 > > 
 > > PS. Using server grade disks (those with TLER) makes it possibile
 > > to overcome this problem for transitory errors.
 > > 
 > > 
 > > Geoff Nordli wrote:
 > > 
 > >> Part of my concern is why one disk would have completely brought
 > >> down the system.  I have seen this come up on the list before,
 > >> but I don't remember any resolutions to fixing it.
 > >> 
 > >> Anyone have any clues to try to prevent this from happening in
 > >> the future?
 > >> 
 > >> thanks,
 > >> 
 > >> Geoff