[OpenIndiana-discuss] zpool and HDD problems

Tue Oct 6 01:14:07 UTC 2015

On 04/10/2015 9:17 PM, Jim Klimov wrote:
>
> 5 октября 2015 г. 0:27:38 CEST, Rainer Heilke
> <rheilke at dragonhearth.com> пишет:
>> Greetings. I've recently had three hard drives fail in my server.
>> One was the OS disk, so I just reinstalled. The other two, however,
>> were each one-half
>>
>> of zpool mirrors. They are the problem disks.
>>
>> Both have been replaced, but now I cannot seem to work with them.
>> In format -e, they are giving errors, specifically: 1. c3d1 <drive
>> type unknown> /pci at 0,0/pci-ide at 11/ide at 0/cmdk at 1,0 and, 7. c7d1
>> <drive type unknown> /pci at 0,0/pci-ide at 14,1/ide at 0/cmdk at 1,0
>>
>> There is also a third disk erroring out: 3. c5t9d1 <SS330055-
>> 99JJXXK-0001 cyl 60797 alt 2 hd 255 sec 63>
>> /pci at 0,0/pci1002,5a17 at 3/pci1000,9240 at 0/sd at 9,1
>>
>> I am suspecting c3d1 to be an old OS mirror, due to the low
>> controller number.
>>
>> When I select 1 or 7, I get a Segmentation fault, and get booted
>> out of
>>
>> the format utility. (If I select 3, the format utility never comes
>> back, freezing.) A zpool status shows:
>>
>> pool: Pool1 state: ONLINE status: Some supported features are not
>> enabled on the pool. The pool can still be used, but some features
>> are unavailable. action: Enable all features using 'zpool upgrade'.
>> Once this is done, the pool may no longer be accessible by software
>> that does not support the features. See zpool-features(5) for
>> details. scan: resilvered 2.78M in 0h0m with 0 errors on Tue Sep 16
>> 14:11:00 2014 config:
>>
>> NAME        STATE     READ WRITE CKSUM Pool1       ONLINE       0
>> 0     0 c5t8d1    ONLINE       0     0     0
>>
>> errors: No known data errors
>>
>> pool: data state: DEGRADED status: One or more devices has
>> experienced an error resulting in data corruption.  Applications
>> may be affected. action: Restore the file in question if possible.
>> Otherwise restore the entire pool from backup. see:
>> http://illumos.org/msg/ZFS-8000-8A scan: resilvered 36.1M in 0h17m
>> with 738 errors on Thu Oct  1 18:17:43 2015 config:
>>
>> NAME                     STATE     READ WRITE CKSUM data
>> DEGRADED 20.6K     0     0 mirror-0               DEGRADED 81.8K
>> 0     0 7152018192933189428  FAULTED      0     0     0  was
>> /dev/dsk/c11t8d1s0 c6d0                 ONLINE       0     0 81.8K
>>
>> errors: 737 data errors, use '-v' for a list
>>
>> (Doing a zpool status -v freezes the terminal.)
>>
>> The system has three disks connected to an LSI MegaRAID SAS
>> 9240-8i controller.
>>
>> I am suspecting that disk 3 (c5t9d1) might be the detached mirror
>> of Pool1 ( c5t8d1), but being unable to work with it, I cannot
>> verify this. I have no idea on how to deal with the data mirror.
>> Should I just detach /dev/dsk/c11t8d1s0 ( 7152018192933189428) and
>> hope that c6d0 will be clean enough for a decent scrub? Or is
>> /dev/dsk/c11t8d1s0 ( 7152018192933189428) the disk with the less
>> corrupted data? Not being able to even get a listing (ls) of the
>> data pool leaves me very hesitant.
>>
>> Does anyone have any ideas on how to clean this up?
>>
>> Thanks in advance, Rainer
>
> As already noted - part of the problem may be IDE access mode: e.g.
> are your disks modern and large (over 2TB IIRC)?
>
> Did you rescan OS devices (devfsadm -Cv)?
>
> Did you try other partitioning programs (parted, fdisk) to see if you
> can access the new disks at all, and in particular to verify that zfs
> managed to make its partitioning? In the worse case you might have to
> define an mbr/efi 'solaris' partition yourself and use it (as
> cXtYdZpN) directly as a pool vdev, or use format afterwards to define
> slices inside that partition and use cXtYdZs0 on the disk. I wrote
> some howto's about 'advanced setup' on OI wiki that can help you get
> started.
>
> But first i'd verify all components work, including hardware. Maybe
> cabling needs to be all re-plugged, the box may need a vacuum cleaner
> (or rather a blow-out), or the power-source has nearly died (aged
> capacitors, etc.)
>
> Jim -- Typos courtesy of K-9 Mail on my Samsung Android
>

I rescanned using devfsadm (again; I had done it before, but no harm 
running it again. It seems to have nothing.

parted and fdisk either say the device doesn't exist, or they can't open 
it. ls -al /dev/rdsk | grep <target> does show full entries for all of 
the disks.

All of the drives are spinning, cabling checked (yet again), the box is 
clean inside (got a good blowout when the CMOS battery issue was found), 
and the power supply was load tested with 9 drives as part of the 
troubleshooting that finally found the battery problem. Those load tests 
ran overnight. Finally, each HDD power lead has both functional and 
problematic drives on them (though I may want to triple-check that).

I'm at a bit of a loss. Bad SATA cables? Why does the listing (ls -al) 
see the drives when parted, fdisk, and format have troubles? (I'm just 
trying to work out the logic of that in my head; it wasn't a real 
question to the list).

Rainer

-- 
Put your makeup on and fix your hair up pretty,
And meet me tonight in Atlantic  City
			Bruce Springsteen