[OpenIndiana-discuss] ZFS & SMART error

Fri Jul 6 07:59:09 UTC 2012

On Jul 6, 2012, at 3:12 AM, Richard Elling wrote:

> On Jul 5, 2012, at 10:13 AM, Reginald Beardsley wrote:
> 
>> I had a power failure last night.  The UPS alarms woke me up and I powered down the systems. (some day I really will automate shutdowns)  It's also been quite hot (90 F) in the room where the computer is.
>> 
>> At boot the BIOS on the HP Z400 running Solaris 10 reported:
>> 
>> 1720 SMART drive detects imminent failure
>> failing drive SATA 1 (black)
>> failing attribute # 05
>> 
>> The drive is a year old 3 TB Hitachi that I use for a scratch drive.  "zpool scrub" showed no errors.
>> 
>> Nothing critical on the disk I just use it to hold intermediate files when working on large datasets.
>> 
>> Unfortunately, "apropos smart" produced nothing useful.  Where would I find more information?  Particularly interaction between ZFS & SMART.
> 
> In current OpenIndiana (and illumos), predictive failure notices are sent to FMA as
> error reports. However, there is no consumer of those reports -- they are blissfully
> ignored :-(  Eventually, when the disk really dies, other FMA agents will react, including
> ZFS sparing, as appropriate.
> 
>> There have apparently been some firmware issues that produce spurious warnings, but HP is so Windows centric it's hard to make any sense out of their support site.
>> 
>> Is there a Solaris/OI utility for reading the SMART information or doing other diagnosis?
>> 
>> The message relates to automatic bad block reallocation.  Would running a scan using format(1m) and mapping out the bad blocks be likely to correct this?  That is what we did way back when.  Though I hate to think how long a scan of 3 TB would take.
> 
> No. For modern disks, bad block remapping is done by the disk. What you see
> in format is for very old disks where the sector management was done by the OS.
> -- richard

It's been my impression on a few occasions that a disk with very limited damage might have any bad areas discovered and effectively repaired by a scan; even on a semi-modern (e.g. older Fibre Channel) disk, the manufacturer's and grown defect list can be extracted, for example; and even when the disk itself handles sparing, a scan might force it wherever needed.  However...if a disk's surface has taken damage, there's always the chance that its internals were polluted with the debris, or the area damaged is irregular enough to lead to further damage.  So I certainly wouldn't trust a disk that gave ANY significant indication of failure for any purpose in which it was critical...whether for mechanical damage or wonky electronics or firmware doesn't even matter, in the long run.  (A glitch or two after a power failure, ok, but not continuing to increase afterward.  I would think most modern disks are able to manage a soft landing as it were in case of power failure, minimizing the risk of _mechanical_ damage, although noise might scramble any data being written at the time.)

It's not 100% clear to me (from the man page, i.e. no time to try to figure it out from the code now) whether zpool scrub examines free blocks as well as blocks that are in some way in use; if not, then a scan with format could potentially spot additional problems.

-- 
eMail:				mailto:rlhamil at smart.net
Home page:			http://www.smart.net/~rlhamil/
Facebook, MySpace,
AIM, Yahoo, etc:		ask