[OpenIndiana-discuss] Broken zpool
Rainer Heilke
rheilke at dragonhearth.com
Thu Oct 29 04:55:47 UTC 2015
On 28/10/2015 1:47 PM, jason matthews wrote:
>
>
> Let me apologize in advanced for inter-mixing comments.
Ditto.
>>> I am not trying to be a dick (it happens naturally), but if you cant
>>> afford to backup terabytes of data, then you cant afford to have
>>> terabytes of data.
>>
>> That is a meaningless statement, that reflects nothing in real-world
>> terms.
> The true cost of a byte of data that you care about is the money you pay
> for the initial storage, and then the money you pay to back it up. For
> work, my front line databases have 64TB of mirrored net storage.
When you said "you," it implied (to me, at least) a home system, since
we're talking about a home system from the start. Certainly, if it is a
system that a company uses for its data, all of what you say is correct.
But a company, regardless of size, can write these expenses off.
Individuals cannot do that with their home systems. For them, this
paradigm is much more vague if it exists at all.
So, while I was talking apples, you were talking parsnips. My apologies
for not making that clearer. (All of that said, the DVD drive has been
acting up. Perhaps a writable Blu-Ray is in the wind. Since the price of
them has dropped further than the price of oil, that may make backups of
the more important data possible.)
> The
> costs dont stop there. There is another 200TB of net storage dedicated
> to holding enough log data to rebuild the last 18 months from scratch. I
> also have two sets of slaves that snapshot themselves frequently. One
> set is a single disk, the other is raidz. These are not just backups.
> One set runs batch jobs, one runs the front end portal, and the masters
> are in charge of data ingestions.
Don't forget the costs added on by off-site storage, etc. I don't care
how many times the data is backed up, if it's all in the same building
that just burned to the ground... That is, unless your zfs sends are
going to a different site...
> If you dont backup, you set yourself up for unrecoverable problems. In
<snip>
I believe this may be the first time (for me) that simply replacing a
failed drive resulted in data corruption in a zpool. I've certainly
never seen this level of mess before.
> That said, instead of running mirrors run loose disks and backup to the
> second pool at a frequency you are comfortable with. You need to
> prioritize your resources against your risk tolerance. It is tempting to
> do mirrors because it is sexy but that might not be the best strategy.
That is something for me to think about. (I don't do *anything* on
computers because it's "sexy." I did mirrors for security (remember,
they hadn't failed for me at such a monumental level previously).
>> That's an arrogant statement, presuming that if a person doesn't have
>> gobs of money, they shouldn't bother with computers at all.
> I didnt write anything like that. What I am saying is you need to get
> more creative on how to protect your data. Yes, money makes it easier
> but you have options.
My apologies; on its own, it came across that way.
>> I am not complaining about the time it takes; I know full well how
>> long it can take. I am complaining that the "resilvering" stops dead.
>> (More on this below.)
> When the scrub is stopped dead, what does "iostat -nMxC 1" look like?
> Are there drives indicating 100% busy? high wait or asvc_t times?
sudo iostat -nMxC 1
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c4
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c4t0d0
23.3 55.2 0.6 0.3 0.2 0.3 2.1 4.5 5 27 c3d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.2 0 0 c3d1
0.0 0.0 0.0 0.0 0.0 0.0 0.0 5.5 0 0 c6d1
360.1 13.1 29.0 0.1 1.3 1.5 3.4 4.0 48 82 c6d0
9.7 330.9 0.0 29.1 0.1 0.6 0.3 1.6 9 52 c7d1
359.9 354.6 28.3 28.5 30.2 3.4 42.2 4.7 85 85 data
23.2 34.9 0.6 0.3 6.2 0.3 106.9 5.6 6 12 rpool
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c4
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c4t0d0
0.0 112.1 0.0 0.3 0.0 0.4 0.0 4.0 0 45 c3d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c3d1
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c6d1
71.0 10.0 2.3 0.0 1.6 1.1 19.8 14.0 54 60 c6d0
40.0 44.0 0.1 2.2 0.2 1.1 1.8 12.8 12 83 c7d1
111.1 58.0 2.4 2.2 18.9 3.5 112.0 20.6 54 54 data
0.0 58.0 0.0 0.3 0.0 0.0 0.0 0.6 0 3 rpool
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c4
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c4t0d0
0.0 187.1 0.0 0.7 0.0 0.8 0.0 4.1 0 74 c3d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c3d1
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c6d1
403.1 0.0 32.9 0.0 1.2 1.8 2.9 4.6 53 97 c6d0
12.0 386.1 0.0 32.9 0.2 0.5 0.4 1.3 12 44 c7d1
415.1 386.1 33.0 32.9 27.6 3.9 34.5 4.9 100 100 data
0.0 98.0 0.0 0.7 0.0 0.1 0.0 1.1 0 8 rpool
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c4
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c4t0d0
0.0 60.0 0.0 0.7 0.0 0.1 0.1 1.8 0 11 c3d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c3d1
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c6d1
399.9 0.0 33.4 0.0 0.7 1.8 1.8 4.6 39 97 c6d0
0.0 401.9 0.0 33.2 0.1 0.4 0.2 1.0 7 40 c7d1
399.9 401.9 33.4 33.2 27.3 3.2 34.0 4.0 100 100 data
0.0 58.0 0.0 0.7 0.4 0.0 7.1 0.6 3 3 rpool
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c4
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c4t0d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c3d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c3d1
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c6d1
381.0 0.0 32.3 0.0 0.9 1.8 2.3 4.8 44 96 c6d0
0.0 384.0 0.0 31.8 0.1 0.4 0.2 1.1 6 42 c7d1
381.0 384.0 32.3 31.8 26.6 3.3 34.8 4.4 100 100 data
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 rpool
So, something IS actually happening, it would seem.
> Do you have any controller errors? Does iostat -En report any errors?
sudo iostat -En
c3d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Model: ST3500514NS Revision: Serial No: 9WJ Size:
500.10GB <500101152768 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0
c3d1 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Model: ST3000NC000 Revision: Serial No: Z1F Size:
3000.59GB <3000590401536 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0
c6d1 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Model: ST4000DM000-1F2 Revision: Serial No: S30 Size:
4000.74GB <4000743161856 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0
c6d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Model: ST32000542AS Revision: Serial No: 5XW Size:
2000.37GB <2000371580928 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0
c7d1 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Model: ST3000DM001-1ER Revision: Serial No: Z7P Size:
3000.59GB <3000590401536 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0
c4t0d0 Soft Errors: 0 Hard Errors: 6 Transport Errors: 0
Vendor: TSSTcorp Product: CDDVDW SH-S222A Revision: SB02 Serial No:
Size: 0.00GB <0 bytes>
Media Error: 0 Device Not Ready: 6 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
> Have you tried mounting the pool ro, stopping the scrub, and then
> copying data off?
It *seems* to be ro now, but I can't be sure. I did:
sudo zfs set readonly=on data
It paused for a few seconds, then gave me a prompt back. It didn't spit
out any errors. But any of my further commands, like trying a copy, froze.
> Here are some hail mary settings that probably wont help. I offer them
> (in no particular order) to try to improve the scrub time performance,
> minimized the number of enqueued I/Os in case that is exacerbating the
> problem some how, and attempt to limit the amount of time spent on a
> failing I/O. Your scrubs may be stopping because you have a disk that
> exhibiting a poor failure mode. Namely, some sort of internal error and
> it just keeps retrying which makes the pool wedge. WD is not the brand I
> go to for enterprise failure modes.
Trust me, I haven't let WD drives anywhere near my computers for quite
some time. (Ironically enough, it was WD drives that, in the early days
of this system, showed me how resilient zfs mirrors were.) They were
replaced by SeaGates, Hitachi drives, or whatever they had that wasn't
WD. I'd rather have trolls hand-chiseling the data into rocks.
> * dont spend more than 8ms on any single i/o
> set sd:sd_io_time=8
> * resilver in 5 second intervals minimum
> set zfs:zfs_resilver_min_time_ms = 5000
> set zfs:zfs_resilver_delay = 0
> * enqueue only 5 I/Os
> set zfs:zfs_top_maxinflight = 2
Thanks, but these all fail with the "pool I/O is currently suspended" error.
> Apply these settings and try to resilver again. If this doesnt work, dd
> the drives to new ones. Using dd will likely identify which drive is
> wedging ZFS as it will either not complete or it will error out.
>>
> I am not sure who the tech is, but at least two people on this list told
> you check the CMOS battery. I think Bob and I both recommended changing
> the battery. Others might have as well.
The tech works for the company that sold the system to me, and holds the
warranty for it. I no longer have the ability or hardware/tools to
change the CMOS battery myself. That, plus the system being under
warranty... I did mention this to the guy I was dealing with, but I got
the distinct feeling the either: a) he never told the tech, or b) the
tech didn't believe it. IIRC, when I _finally_ talked to the tech, he
sounded quite surprised that the battery was dieing, and that it
affected the system so badly. This is one of the joys of doing this
stuff over the phone.
I may be wrong, though. There has been so much crappola happening in the
last year... :-(
> I reviewed your output again. You have two disks in a mirror. Each disk
> is resilvering. This means the two disks are resilvering from each
> other. There are no other options as there are no other vdevs in the pool.
I can see the new disk resilvering from the old one, but why did ZFS
start resilvering the old disk from the new one? Shouldn't it have spat
out a "corrupt data" error and forced me to scrub? This odd state is how
the system booted up. I would obviously have lost some data, but after
the scrub, the pool would at least have been functional enough to do a
replace (and then resilvered).
> ZFS suspects both disks could be dirty,
I can go along with that. Some data on the original drive is corrupt,
and the new disk doesn't have any data (a special form of "corrupt").
> that's why it is resilvering
> both drives. This puts the drives under heavy load. This load is
> probably surfacing an internal error on one or more drives but because
> WD has crappy failure modes it is not sending the error to OS.
> Internally, the drive keeps retrying with errors and the OS keeps
> waiting for the drive to return from write flush. This is likely what is
> wedging the pool. The problem is likely on the drive -- but I cant say
> that with certainty. Certainty is a big word.
Again, no WD cow patties
> There is another option, which has the potential to make your data
> diverge across the two disks if you dont mount them read-only.
>
> Basically, reboot the system with just one of the vdevs installed and
> mounted read-only.
I'm going to try this route, and see what I can get it to do. So far, it
hasn't locked up on a command. There's something curious, though. When I
try a zpool status -v,it tells me:
errors: List of errors unavailable (insufficient privileges)
It gives me that when running under my user ID, doing it via sudo, and
even when I su to root.The *first* time I ran it was using sudo, and it
told me there are 50056 data errors. Every time I run it again, this
message is not given; only the very first time after the boot-up.
> If you are in the SF bay area and want to bring it by my office I am
> happy to take a stab it, after you back up your original disks (for
> liability reasons). I can provide a clean working "text book" system if
> needed, bring your system and drives and we can likely salvage it one
> way or another.
Thank you very much for the offer, but I'm a couple thousand miles or so
north of you.
I have noticed one thing, though; the resilvering numbers _are_ actually
increasing, now. Since the original disk (all others disconnected) is
actually showing a change since yesterday, I'm going to pack it in for
the night, and see where the count has gotten to tomorrow evening. It
says: "5h11m to go", but I strongly suspect it will be longer. I'll make
a note of where it's at right now.
Rainer
--
Put your makeup on and fix your hair up pretty,
And meet me tonight in Atlantic City
Bruce Springsteen
More information about the openindiana-discuss
mailing list