[OpenIndiana-discuss] Broken zpool

Rainer Heilke rheilke at dragonhearth.com
Thu Oct 29 04:55:47 UTC 2015


On 28/10/2015 1:47 PM, jason matthews wrote:
>
>
> Let me apologize in advanced for inter-mixing comments.
Ditto.

>>> I am not trying to be a dick (it happens naturally), but if you cant
>>> afford to backup terabytes of data, then you cant afford to have
>>> terabytes of data.
>>
>> That is a meaningless statement, that reflects nothing in real-world
>> terms.
> The true cost of a byte of data that you care about is the money you pay
> for the initial storage, and then the money you pay to back it up. For
> work, my front line databases have 64TB of mirrored net storage.

When you said "you," it implied (to me, at least) a home system, since 
we're talking about a home system from the start. Certainly, if it is a 
system that a company uses for its data, all of what you say is correct. 
But a company, regardless of size, can write these expenses off. 
Individuals cannot do that with their home systems. For them, this 
paradigm is much more vague if it exists at all.

So, while I was talking apples, you were talking parsnips. My apologies 
for not making that clearer. (All of that said, the DVD drive has been 
acting up. Perhaps a writable Blu-Ray is in the wind. Since the price of 
them has dropped further than the price of oil, that may make backups of 
the more important data possible.)

> The
> costs dont stop there. There is another 200TB of net storage dedicated
> to holding enough log data to rebuild the last 18 months from scratch. I
> also have two sets of slaves that snapshot themselves frequently. One
> set is a single disk, the other is raidz. These are not just backups.
> One set runs batch jobs, one runs the front end portal, and the masters
> are in charge of data ingestions.

Don't forget the costs added on by off-site storage, etc. I don't care 
how many times the data is backed up, if it's all in the same building 
that just burned to the ground... That is, unless your zfs sends are 
going to a different site...

> If you dont backup, you set yourself up for unrecoverable problems. In
<snip>
I believe this may be the first time (for me) that simply replacing a 
failed drive resulted in data corruption in a zpool. I've certainly 
never seen this level of mess before.

> That said, instead of running mirrors run loose disks and backup to the
> second pool at a frequency you are comfortable with. You need to
> prioritize your resources against your risk tolerance. It is tempting to
> do mirrors because it is sexy but that might not be the best strategy.

That is something for me to think about. (I don't do *anything* on 
computers because it's "sexy." I did mirrors for security (remember, 
they hadn't failed for me at such a monumental level previously).

>> That's an arrogant statement, presuming that if a person doesn't have
>> gobs of money, they shouldn't bother with computers at all.
> I didnt write anything like that. What I am saying is you need to get
> more creative on how to protect your data. Yes, money makes it easier
> but you have options.

My apologies; on its own, it came across that way.

>> I am not complaining about the time it takes; I know full well how
>> long it can take. I am complaining that the "resilvering" stops dead.
>> (More on this below.)

> When the scrub is stopped dead, what does "iostat -nMxC  1" look like?
> Are there drives indicating 100% busy? high wait or asvc_t times?

  sudo iostat -nMxC  1
                     extended device statistics
     r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c4
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c4t0d0
    23.3   55.2    0.6    0.3  0.2  0.3    2.1    4.5   5  27 c3d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.2   0   0 c3d1
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    5.5   0   0 c6d1
   360.1   13.1   29.0    0.1  1.3  1.5    3.4    4.0  48  82 c6d0
     9.7  330.9    0.0   29.1  0.1  0.6    0.3    1.6   9  52 c7d1
   359.9  354.6   28.3   28.5 30.2  3.4   42.2    4.7  85  85 data
    23.2   34.9    0.6    0.3  6.2  0.3  106.9    5.6   6  12 rpool
                     extended device statistics
     r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c4
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c4t0d0
     0.0  112.1    0.0    0.3  0.0  0.4    0.0    4.0   0  45 c3d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c3d1
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c6d1
    71.0   10.0    2.3    0.0  1.6  1.1   19.8   14.0  54  60 c6d0
    40.0   44.0    0.1    2.2  0.2  1.1    1.8   12.8  12  83 c7d1
   111.1   58.0    2.4    2.2 18.9  3.5  112.0   20.6  54  54 data
     0.0   58.0    0.0    0.3  0.0  0.0    0.0    0.6   0   3 rpool
                     extended device statistics
     r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c4
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c4t0d0
     0.0  187.1    0.0    0.7  0.0  0.8    0.0    4.1   0  74 c3d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c3d1
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c6d1
   403.1    0.0   32.9    0.0  1.2  1.8    2.9    4.6  53  97 c6d0
    12.0  386.1    0.0   32.9  0.2  0.5    0.4    1.3  12  44 c7d1
   415.1  386.1   33.0   32.9 27.6  3.9   34.5    4.9 100 100 data
     0.0   98.0    0.0    0.7  0.0  0.1    0.0    1.1   0   8 rpool
                     extended device statistics
     r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c4
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c4t0d0
     0.0   60.0    0.0    0.7  0.0  0.1    0.1    1.8   0  11 c3d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c3d1
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c6d1
   399.9    0.0   33.4    0.0  0.7  1.8    1.8    4.6  39  97 c6d0
     0.0  401.9    0.0   33.2  0.1  0.4    0.2    1.0   7  40 c7d1
   399.9  401.9   33.4   33.2 27.3  3.2   34.0    4.0 100 100 data
     0.0   58.0    0.0    0.7  0.4  0.0    7.1    0.6   3   3 rpool
                     extended device statistics
     r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c4
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c4t0d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c3d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c3d1
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c6d1
   381.0    0.0   32.3    0.0  0.9  1.8    2.3    4.8  44  96 c6d0
     0.0  384.0    0.0   31.8  0.1  0.4    0.2    1.1   6  42 c7d1
   381.0  384.0   32.3   31.8 26.6  3.3   34.8    4.4 100 100 data
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 rpool

So, something IS actually happening, it would seem.

> Do you have any controller errors? Does iostat -En report any errors?

sudo iostat -En
c3d0             Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Model: ST3500514NS     Revision:  Serial No:             9WJ Size: 
500.10GB <500101152768 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0
c3d1             Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Model: ST3000NC000     Revision:  Serial No:             Z1F Size: 
3000.59GB <3000590401536 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0
c6d1             Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Model: ST4000DM000-1F2 Revision:  Serial No:             S30 Size: 
4000.74GB <4000743161856 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0
c6d0             Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Model: ST32000542AS    Revision:  Serial No:             5XW Size: 
2000.37GB <2000371580928 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0
c7d1             Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Model: ST3000DM001-1ER Revision:  Serial No:             Z7P Size: 
3000.59GB <3000590401536 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0
c4t0d0           Soft Errors: 0 Hard Errors: 6 Transport Errors: 0
Vendor: TSSTcorp Product: CDDVDW SH-S222A  Revision: SB02 Serial No:
Size: 0.00GB <0 bytes>
Media Error: 0 Device Not Ready: 6 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0


> Have you tried mounting the pool ro, stopping the scrub, and then
> copying data off?

It *seems* to be ro now, but I can't be sure. I did:
sudo zfs set readonly=on data
It paused for a few seconds, then gave me a prompt back. It didn't spit 
out any errors. But any of my further commands, like trying a copy, froze.

> Here are some hail mary settings that probably wont help. I offer them
> (in no particular order) to try to improve the scrub time performance,
> minimized the number of enqueued I/Os in case that is exacerbating the
> problem some how, and attempt to limit the amount of time spent on a
> failing I/O. Your scrubs may be stopping because you have a disk that
> exhibiting a poor failure mode. Namely, some sort of internal error and
> it just keeps retrying which makes the pool wedge. WD is not the brand I
> go to for enterprise failure modes.

Trust me, I haven't let WD drives anywhere near my computers for quite 
some time. (Ironically enough, it was WD drives that, in the early days 
of this system, showed me how resilient zfs mirrors were.) They were 
replaced by SeaGates, Hitachi drives, or whatever they had that wasn't 
WD. I'd rather have trolls hand-chiseling the data into rocks.

> * dont spend more than 8ms on any single i/o
> set sd:sd_io_time=8
> * resilver in 5 second intervals minimum
> set zfs:zfs_resilver_min_time_ms = 5000
> set zfs:zfs_resilver_delay = 0
> * enqueue only 5 I/Os
> set zfs:zfs_top_maxinflight = 2

Thanks, but these all fail with the "pool I/O is currently suspended" error.

> Apply these settings and try to resilver again. If this doesnt work, dd
> the drives to new ones. Using dd will likely identify which drive is
> wedging ZFS as it will either not complete or it will error out.
>>
> I am not sure who the tech is, but at least two people on this list told
> you check the CMOS battery. I think Bob and I both recommended changing
> the battery. Others might have as well.

The tech works for the company that sold the system to me, and holds the 
warranty for it. I no longer have the ability or hardware/tools to 
change the CMOS battery myself. That, plus the system being under 
warranty... I did mention this to the guy I was dealing with, but I got 
the distinct feeling the either: a) he never told the tech, or b) the 
tech didn't believe it. IIRC, when I _finally_ talked to the tech, he 
sounded quite surprised that the battery was dieing, and that it 
affected the system so badly. This is one of the joys of doing this 
stuff over the phone.

I may be wrong, though. There has been so much crappola happening in the 
last year... :-(

> I reviewed your output again. You have two disks in a mirror. Each disk
> is resilvering. This means the two disks are resilvering from each
> other. There are no other options as there are no other vdevs in the pool.

I can see the new disk resilvering from the old one, but why did ZFS 
start resilvering the old disk from the new one? Shouldn't it have spat 
out a "corrupt data" error and forced me to scrub? This odd state is how 
the system booted up. I would obviously have lost some data, but after 
the scrub, the pool would at least have been functional enough to do a 
replace (and then resilvered).

> ZFS suspects both disks could be dirty,

I can go along with that. Some data on the original drive is corrupt, 
and the new disk doesn't have any data (a special form of "corrupt").

> that's why it is resilvering
> both drives. This puts the drives under heavy load. This load is
> probably surfacing an internal error on one or more drives but because
> WD has crappy failure modes it is not sending the error to OS.
> Internally, the drive keeps retrying with errors and the OS keeps
> waiting for the drive to return from write flush. This is likely what is
> wedging the pool. The problem is likely on the drive -- but I cant say
> that with certainty. Certainty is a big word.

Again, no WD cow patties

> There is another option, which has the potential to make your data
> diverge across the two disks if you dont mount them read-only.
>
> Basically, reboot the system with just one of the vdevs installed and
> mounted read-only.

I'm going to try this route, and see what I can get it to do. So far, it 
hasn't locked up on a command. There's something curious, though. When I 
try a zpool status -v,it tells me:
errors: List of errors unavailable (insufficient privileges)

It gives me that when running under my user ID, doing it via sudo, and 
even when I su to root.The *first* time I ran it was using sudo, and it 
told me there are 50056 data errors. Every time I run it again, this 
message is not given; only the very first time after the boot-up.

> If you are in the SF bay area and want to bring it by my office I am
> happy to take a stab it, after you back up your original disks (for
> liability reasons). I can provide a clean working "text book" system if
> needed, bring your system and drives and we can likely salvage it one
> way or another.

Thank you very much for the offer, but I'm a couple thousand miles or so 
north of you.

I have noticed one thing, though; the resilvering numbers _are_ actually 
increasing, now. Since the original disk (all others disconnected) is 
actually showing a change since yesterday, I'm going to pack it in for 
the night, and see where the count has gotten to tomorrow evening. It 
says: "5h11m to go", but I strongly suspect it will be longer. I'll make 
a note of where it's at right now.

Rainer

-- 
Put your makeup on and fix your hair up pretty,
And meet me tonight in Atlantic  City
			Bruce Springsteen
			



More information about the openindiana-discuss mailing list