[OpenIndiana-discuss] Broken zpool

Thu Oct 29 19:17:47 UTC 2015

So, last night, the resilvering was still running on the only (original) 
drive in the zpool. When I checked in at 5:00 AM, and it was finished. 
(There were "too many errors" on the disk, and said to run a zpool 
clear. I'm thinking now that I should have done a dd right away, but 
foggy with sleep, I started the zpool clear. This is still running, and 
the HDD Activity blinkenlight is on steady. I will take a look when it 
finishes (an ls in another terminal window doesn't return).

Fingers crossed, sacrifices to the computer gods, and several "Hail 
Cthulus."

Rainer

On 28/10/2015 9:55 PM, Rainer Heilke wrote:
>
> On 28/10/2015 1:47 PM, jason matthews wrote:
>>
>>
>> Let me apologize in advanced for inter-mixing comments.
> Ditto.
>
>>>> I am not trying to be a dick (it happens naturally), but if you cant
>>>> afford to backup terabytes of data, then you cant afford to have
>>>> terabytes of data.
>>>
>>> That is a meaningless statement, that reflects nothing in real-world
>>> terms.
>> The true cost of a byte of data that you care about is the money you pay
>> for the initial storage, and then the money you pay to back it up. For
>> work, my front line databases have 64TB of mirrored net storage.
>
> When you said "you," it implied (to me, at least) a home system, since
> we're talking about a home system from the start. Certainly, if it is a
> system that a company uses for its data, all of what you say is correct.
> But a company, regardless of size, can write these expenses off.
> Individuals cannot do that with their home systems. For them, this
> paradigm is much more vague if it exists at all.
>
> So, while I was talking apples, you were talking parsnips. My apologies
> for not making that clearer. (All of that said, the DVD drive has been
> acting up. Perhaps a writable Blu-Ray is in the wind. Since the price of
> them has dropped further than the price of oil, that may make backups of
> the more important data possible.)
>
>> The
>> costs dont stop there. There is another 200TB of net storage dedicated
>> to holding enough log data to rebuild the last 18 months from scratch. I
>> also have two sets of slaves that snapshot themselves frequently. One
>> set is a single disk, the other is raidz. These are not just backups.
>> One set runs batch jobs, one runs the front end portal, and the masters
>> are in charge of data ingestions.
>
> Don't forget the costs added on by off-site storage, etc. I don't care
> how many times the data is backed up, if it's all in the same building
> that just burned to the ground... That is, unless your zfs sends are
> going to a different site...
>
>> If you dont backup, you set yourself up for unrecoverable problems. In
> <snip>
> I believe this may be the first time (for me) that simply replacing a
> failed drive resulted in data corruption in a zpool. I've certainly
> never seen this level of mess before.
>
>> That said, instead of running mirrors run loose disks and backup to the
>> second pool at a frequency you are comfortable with. You need to
>> prioritize your resources against your risk tolerance. It is tempting to
>> do mirrors because it is sexy but that might not be the best strategy.
>
> That is something for me to think about. (I don't do *anything* on
> computers because it's "sexy." I did mirrors for security (remember,
> they hadn't failed for me at such a monumental level previously).
>
>>> That's an arrogant statement, presuming that if a person doesn't have
>>> gobs of money, they shouldn't bother with computers at all.
>> I didnt write anything like that. What I am saying is you need to get
>> more creative on how to protect your data. Yes, money makes it easier
>> but you have options.
>
> My apologies; on its own, it came across that way.
>
>>> I am not complaining about the time it takes; I know full well how
>>> long it can take. I am complaining that the "resilvering" stops dead.
>>> (More on this below.)
>
>> When the scrub is stopped dead, what does "iostat -nMxC  1" look like?
>> Are there drives indicating 100% busy? high wait or asvc_t times?
>
>   sudo iostat -nMxC  1
>                      extended device statistics
>      r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
>      0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c4
>      0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c4t0d0
>     23.3   55.2    0.6    0.3  0.2  0.3    2.1    4.5   5  27 c3d0
>      0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.2   0   0 c3d1
>      0.0    0.0    0.0    0.0  0.0  0.0    0.0    5.5   0   0 c6d1
>    360.1   13.1   29.0    0.1  1.3  1.5    3.4    4.0  48  82 c6d0
>      9.7  330.9    0.0   29.1  0.1  0.6    0.3    1.6   9  52 c7d1
>    359.9  354.6   28.3   28.5 30.2  3.4   42.2    4.7  85  85 data
>     23.2   34.9    0.6    0.3  6.2  0.3  106.9    5.6   6  12 rpool
>                      extended device statistics
>      r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
>      0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c4
>      0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c4t0d0
>      0.0  112.1    0.0    0.3  0.0  0.4    0.0    4.0   0  45 c3d0
>      0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c3d1
>      0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c6d1
>     71.0   10.0    2.3    0.0  1.6  1.1   19.8   14.0  54  60 c6d0
>     40.0   44.0    0.1    2.2  0.2  1.1    1.8   12.8  12  83 c7d1
>    111.1   58.0    2.4    2.2 18.9  3.5  112.0   20.6  54  54 data
>      0.0   58.0    0.0    0.3  0.0  0.0    0.0    0.6   0   3 rpool
>                      extended device statistics
>      r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
>      0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c4
>      0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c4t0d0
>      0.0  187.1    0.0    0.7  0.0  0.8    0.0    4.1   0  74 c3d0
>      0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c3d1
>      0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c6d1
>    403.1    0.0   32.9    0.0  1.2  1.8    2.9    4.6  53  97 c6d0
>     12.0  386.1    0.0   32.9  0.2  0.5    0.4    1.3  12  44 c7d1
>    415.1  386.1   33.0   32.9 27.6  3.9   34.5    4.9 100 100 data
>      0.0   98.0    0.0    0.7  0.0  0.1    0.0    1.1   0   8 rpool
>                      extended device statistics
>      r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
>      0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c4
>      0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c4t0d0
>      0.0   60.0    0.0    0.7  0.0  0.1    0.1    1.8   0  11 c3d0
>      0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c3d1
>      0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c6d1
>    399.9    0.0   33.4    0.0  0.7  1.8    1.8    4.6  39  97 c6d0
>      0.0  401.9    0.0   33.2  0.1  0.4    0.2    1.0   7  40 c7d1
>    399.9  401.9   33.4   33.2 27.3  3.2   34.0    4.0 100 100 data
>      0.0   58.0    0.0    0.7  0.4  0.0    7.1    0.6   3   3 rpool
>                      extended device statistics
>      r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
>      0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c4
>      0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c4t0d0
>      0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c3d0
>      0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c3d1
>      0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c6d1
>    381.0    0.0   32.3    0.0  0.9  1.8    2.3    4.8  44  96 c6d0
>      0.0  384.0    0.0   31.8  0.1  0.4    0.2    1.1   6  42 c7d1
>    381.0  384.0   32.3   31.8 26.6  3.3   34.8    4.4 100 100 data
>      0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 rpool
>
> So, something IS actually happening, it would seem.
>
>> Do you have any controller errors? Does iostat -En report any errors?
>
> sudo iostat -En
> c3d0             Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
> Model: ST3500514NS     Revision:  Serial No:             9WJ Size:
> 500.10GB <500101152768 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
> Illegal Request: 0
> c3d1             Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
> Model: ST3000NC000     Revision:  Serial No:             Z1F Size:
> 3000.59GB <3000590401536 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
> Illegal Request: 0
> c6d1             Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
> Model: ST4000DM000-1F2 Revision:  Serial No:             S30 Size:
> 4000.74GB <4000743161856 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
> Illegal Request: 0
> c6d0             Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
> Model: ST32000542AS    Revision:  Serial No:             5XW Size:
> 2000.37GB <2000371580928 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
> Illegal Request: 0
> c7d1             Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
> Model: ST3000DM001-1ER Revision:  Serial No:             Z7P Size:
> 3000.59GB <3000590401536 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
> Illegal Request: 0
> c4t0d0           Soft Errors: 0 Hard Errors: 6 Transport Errors: 0
> Vendor: TSSTcorp Product: CDDVDW SH-S222A  Revision: SB02 Serial No:
> Size: 0.00GB <0 bytes>
> Media Error: 0 Device Not Ready: 6 No Device: 0 Recoverable: 0
> Illegal Request: 0 Predictive Failure Analysis: 0
>
>
>> Have you tried mounting the pool ro, stopping the scrub, and then
>> copying data off?
>
> It *seems* to be ro now, but I can't be sure. I did:
> sudo zfs set readonly=on data
> It paused for a few seconds, then gave me a prompt back. It didn't spit
> out any errors. But any of my further commands, like trying a copy, froze.
>
>> Here are some hail mary settings that probably wont help. I offer them
>> (in no particular order) to try to improve the scrub time performance,
>> minimized the number of enqueued I/Os in case that is exacerbating the
>> problem some how, and attempt to limit the amount of time spent on a
>> failing I/O. Your scrubs may be stopping because you have a disk that
>> exhibiting a poor failure mode. Namely, some sort of internal error and
>> it just keeps retrying which makes the pool wedge. WD is not the brand I
>> go to for enterprise failure modes.
>
> Trust me, I haven't let WD drives anywhere near my computers for quite
> some time. (Ironically enough, it was WD drives that, in the early days
> of this system, showed me how resilient zfs mirrors were.) They were
> replaced by SeaGates, Hitachi drives, or whatever they had that wasn't
> WD. I'd rather have trolls hand-chiseling the data into rocks.
>
>> * dont spend more than 8ms on any single i/o
>> set sd:sd_io_time=8
>> * resilver in 5 second intervals minimum
>> set zfs:zfs_resilver_min_time_ms = 5000
>> set zfs:zfs_resilver_delay = 0
>> * enqueue only 5 I/Os
>> set zfs:zfs_top_maxinflight = 2
>
> Thanks, but these all fail with the "pool I/O is currently suspended"
> error.
>
>> Apply these settings and try to resilver again. If this doesnt work, dd
>> the drives to new ones. Using dd will likely identify which drive is
>> wedging ZFS as it will either not complete or it will error out.
>>>
>> I am not sure who the tech is, but at least two people on this list told
>> you check the CMOS battery. I think Bob and I both recommended changing
>> the battery. Others might have as well.
>
> The tech works for the company that sold the system to me, and holds the
> warranty for it. I no longer have the ability or hardware/tools to
> change the CMOS battery myself. That, plus the system being under
> warranty... I did mention this to the guy I was dealing with, but I got
> the distinct feeling the either: a) he never told the tech, or b) the
> tech didn't believe it. IIRC, when I _finally_ talked to the tech, he
> sounded quite surprised that the battery was dieing, and that it
> affected the system so badly. This is one of the joys of doing this
> stuff over the phone.
>
> I may be wrong, though. There has been so much crappola happening in the
> last year... :-(
>
>> I reviewed your output again. You have two disks in a mirror. Each disk
>> is resilvering. This means the two disks are resilvering from each
>> other. There are no other options as there are no other vdevs in the
>> pool.
>
> I can see the new disk resilvering from the old one, but why did ZFS
> start resilvering the old disk from the new one? Shouldn't it have spat
> out a "corrupt data" error and forced me to scrub? This odd state is how
> the system booted up. I would obviously have lost some data, but after
> the scrub, the pool would at least have been functional enough to do a
> replace (and then resilvered).
>
>> ZFS suspects both disks could be dirty,
>
> I can go along with that. Some data on the original drive is corrupt,
> and the new disk doesn't have any data (a special form of "corrupt").
>
>> that's why it is resilvering
>> both drives. This puts the drives under heavy load. This load is
>> probably surfacing an internal error on one or more drives but because
>> WD has crappy failure modes it is not sending the error to OS.
>> Internally, the drive keeps retrying with errors and the OS keeps
>> waiting for the drive to return from write flush. This is likely what is
>> wedging the pool. The problem is likely on the drive -- but I cant say
>> that with certainty. Certainty is a big word.
>
> Again, no WD cow patties
>
>> There is another option, which has the potential to make your data
>> diverge across the two disks if you dont mount them read-only.
>>
>> Basically, reboot the system with just one of the vdevs installed and
>> mounted read-only.
>
> I'm going to try this route, and see what I can get it to do. So far, it
> hasn't locked up on a command. There's something curious, though. When I
> try a zpool status -v,it tells me:
> errors: List of errors unavailable (insufficient privileges)
>
> It gives me that when running under my user ID, doing it via sudo, and
> even when I su to root.The *first* time I ran it was using sudo, and it
> told me there are 50056 data errors. Every time I run it again, this
> message is not given; only the very first time after the boot-up.
>
>> If you are in the SF bay area and want to bring it by my office I am
>> happy to take a stab it, after you back up your original disks (for
>> liability reasons). I can provide a clean working "text book" system if
>> needed, bring your system and drives and we can likely salvage it one
>> way or another.
>
> Thank you very much for the offer, but I'm a couple thousand miles or so
> north of you.
>
> I have noticed one thing, though; the resilvering numbers _are_ actually
> increasing, now. Since the original disk (all others disconnected) is
> actually showing a change since yesterday, I'm going to pack it in for
> the night, and see where the count has gotten to tomorrow evening. It
> says: "5h11m to go", but I strongly suspect it will be longer. I'll make
> a note of where it's at right now.
>
> Rainer
>

-- 
Put your makeup on and fix your hair up pretty,
And meet me tonight in Atlantic  City
			Bruce Springsteen