[OpenIndiana-discuss] Defect zpool after replacing hard discs
thorsten.heit at vkb.de
thorsten.heit at vkb.de
Fri Jul 19 07:13:16 UTC 2013
Hi,
I hope someone of you can shed a light on a problem with a zpool I now
have...
I have a HP ProLiant server with 3x 160GB discs configured as a hardware
RAID5 on the built-in SmartArray E200i controller. Another three 500GB
discs are attached to the controller, each one as (controller-side) raid
0, and configured in the OS as a raidz1 pool. Recently one of the 500GB
was marked in the zpool as offline as I could see in the console by "zpool
status -v"; obviously because it repeatedly threw some errors.
To ensure that the disc is really faulty I rebooted into a Linux live CD
image I had at hand (Ubunu 9.10, to be precisely), and tried a simple "dd
if=/dev/... of=/dev/null" which resulted in a read error after a short
time. Afterwards the hardware raid controller took the disc offline and
marked it as faulty - a red warning light on the disc cage was turned on.
Because spare parts from HP are exorbitant expensive and come with only 6
months of warranty I thought I'd better buy three new and bigger discs to
replace the old 500GB ones, and ended up buying 3x WD1003FBYX from Western
Digital (enterprise discs, 1TB) that come with 5 years warranty.
I took the faulty disc offline ("zpool offline daten c3t1d0"), replaced it
with the newer, bigger one, booted the server, and let the resilvering
begin via "zpool daten replace c3t1d0 c3t1d0". After some hours,
everything seemed to be fine, the zpool was healthy again. Then the same
procedure with the second disc: offline, replacing, rebooting,
resilvering, waiting => finished, the pool was healty ("zpool status -v").
The problems began when I repeated the prodecure with the last disc:
Some minutes after the resilvering was started messages appeared on the
console (also available in the logs under /var/adm) that made me feel
something bad is going to happen:
NOTICE: Smart Array E200i Controller
Fatal drive error, Port: 2I Box: 1 Bay: 5
Because it was time to leave off work, I let the system continue its work.
When I came back the next day to control the results, it was quite plain
to me that there's really something bad:
"zpool status -v" listed several millions of checksum errors at c3t1d0 and
c3t2d0, the two already (successfully) replaced discs. IIRC one of them
had also lots of write errors and was taken offline, and zpool told me the
pool was damaged :-(
I thought, reboot is good, took the Linux CD to check whether there's
really a hardware error with the replaced discs - I can't imagine that a
new disc (manufactured end of may this year according to a label on the
disc) that was working fine until a reboot is suddenly faulty -, entered
the already mentioned dd if=/dev/... of=/dev/null, but that seemed to work
for c3t1d0. Strange.
One reboot later the hardware RAID controller directly marked that disc as
faulty for unknown reason and took it offline. I powered down the machine,
removed the disc, placed it on another port, but the result stayed the
same: The controller won't use the disc anymore.
Now I have the following problem:
Theroretically the two working 500GB discs still contain the original pool
data, but I cannot import the pool anymore after I put those discs back
into the server:
root at 7iv05-server-1:~# zpool import
pool: daten
id: 6681551211158838832
state: UNAVAIL
status: One or more devices are unavailable.
action: The pool cannot be imported due to unavailable devices or data.
config:
daten UNAVAIL insufficient replicas
raidz1-0 UNAVAIL insufficient replicas
c3t1d0 UNAVAIL cannot open
c3t2d0 UNAVAIL corrupted data
c3t3d0 ONLINE
device details:
c3t1d0 UNAVAIL cannot open
status: ZFS detected errors on this device.
The device was missing.
c3t2d0 UNAVAIL corrupted data
status: ZFS detected errors on this device.
The device has bad label or disk contents.
root at 7iv05-server-1:~#
c3t1d0 is the defect one; the other ones should still be working, and
c3t3d0 is the one that had to replaced last. But: Why does zpool report
corrupted data?
Similar problems arise when I replace c3t1d0 and c3t2d0 with the new
discs, boot and try "zpool import". The first disc isn't activated by the
controller (see above), don't know why, and the cpqary3 driver reports the
following message when the systems is booting:
Physical drive failure, Port: 1I Box: 1 Bay: 4
Failure Reason............ Not Ready - Bad Sense Code
i.e. zpool doesn't see the disc because it is obviously offline; the
second disc is marked as online, and the third is claimed to contain
corrupted data. In the meantime I tried several combinations of old discs
/ new discs, but after booting at most one of them is marked as online,
the other ones unavailable or otherwise offline.
Do you know a trick how I can reactivate at least temporary the old pool
so I can backup its data?
Regards
Thorsten
More information about the OpenIndiana-discuss
mailing list