[OpenIndiana-discuss] Defect zpool after replacing hard discs

Fri Jul 19 07:13:16 UTC 2013

Hi,

I hope someone of you can shed a light on a problem with a zpool I now 
have...

I have a HP ProLiant server with 3x 160GB discs configured as a hardware 
RAID5 on the built-in SmartArray E200i controller. Another three 500GB 
discs are attached to the controller, each one as (controller-side) raid 
0, and configured in the OS as a raidz1 pool. Recently one of the 500GB 
was marked in the zpool as offline as I could see in the console by "zpool 
status -v"; obviously because it repeatedly threw some errors.

To ensure that the disc is really faulty I rebooted into a Linux live CD 
image I had at hand (Ubunu 9.10, to be precisely), and tried a simple "dd 
if=/dev/... of=/dev/null" which resulted in a read error after a short 
time. Afterwards the hardware raid controller took the disc offline and 
marked it as faulty - a red warning light on the disc cage was turned on.

Because spare parts from HP are exorbitant expensive and come with only 6 
months of warranty I thought I'd better buy three new and bigger discs to 
replace the old 500GB ones, and ended up buying 3x WD1003FBYX from Western 
Digital (enterprise discs, 1TB) that come with 5 years warranty.

I took the faulty disc offline ("zpool offline daten c3t1d0"), replaced it 
with the newer, bigger one, booted the server, and let the resilvering 
begin via "zpool daten replace c3t1d0 c3t1d0". After some hours, 
everything seemed to be fine, the zpool was healthy again. Then the same 
procedure with the second disc: offline, replacing, rebooting, 
resilvering, waiting => finished, the pool was healty ("zpool status -v").

The problems began when I repeated the prodecure with the last disc:

Some minutes after the resilvering was started messages appeared on the 
console (also available in the logs under /var/adm) that made me feel 
something bad is going to happen:

NOTICE:  Smart Array E200i Controller 
 Fatal drive error, Port: 2I Box: 1 Bay: 5 

Because it was time to leave off work, I let the system continue its work. 
When I came back the next day to control the results, it was quite plain 
to me that there's really something bad:
"zpool status -v" listed several millions of checksum errors at c3t1d0 and 
c3t2d0, the two already (successfully) replaced discs. IIRC one of them 
had also lots of write errors and was taken offline, and zpool told me the 
pool was damaged :-(

I thought, reboot is good, took the Linux CD to check whether there's 
really a hardware error with the replaced discs - I can't imagine that a 
new disc (manufactured end of may this year according to a label on the 
disc) that was working fine until a reboot is suddenly faulty -, entered 
the already mentioned dd if=/dev/... of=/dev/null, but that seemed to work 
for c3t1d0. Strange.
One reboot later the hardware RAID controller directly marked that disc as 
faulty for unknown reason and took it offline. I powered down the machine, 
removed the disc, placed it on another port, but the result stayed the 
same: The controller won't use the disc anymore.

Now I have the following problem:
Theroretically the two working 500GB discs still contain the original pool 
data, but I cannot import the pool anymore after I put those discs back 
into the server:

root at 7iv05-server-1:~# zpool import 
  pool: daten 
    id: 6681551211158838832 
 state: UNAVAIL 
status: One or more devices are unavailable.
action: The pool cannot be imported due to unavailable devices or data. 
config: 

        daten       UNAVAIL  insufficient replicas 
          raidz1-0  UNAVAIL  insufficient replicas 
            c3t1d0  UNAVAIL  cannot open 
            c3t2d0  UNAVAIL  corrupted data 
            c3t3d0  ONLINE

device details: 

        c3t1d0    UNAVAIL         cannot open 
        status: ZFS detected errors on this device. 
                The device was missing. 

        c3t2d0    UNAVAIL         corrupted data 
        status: ZFS detected errors on this device. 
                The device has bad label or disk contents. 

root at 7iv05-server-1:~#  

c3t1d0 is the defect one; the other ones should still be working, and 
c3t3d0 is the one that had to replaced last. But: Why does zpool report 
corrupted data?

Similar problems arise when I replace c3t1d0 and c3t2d0 with the new 
discs, boot and try "zpool import". The first disc isn't activated by the 
controller (see above), don't know why, and the cpqary3 driver reports the 
following message when the systems is booting:

Physical drive failure, Port: 1I Box: 1 Bay: 4 
Failure Reason............ Not Ready - Bad Sense Code 

i.e. zpool doesn't see the disc because it is obviously offline; the 
second disc is marked as online, and the third is claimed to contain 
corrupted data. In the meantime I tried several combinations of old discs 
/ new discs, but after booting at most one of them is marked as online, 
the other ones unavailable or otherwise offline.

Do you know a trick how I can reactivate at least temporary the old pool 
so I can backup its data?

Regards

Thorsten