[OpenIndiana-discuss] Broken zpool

Wed Oct 28 20:47:02 UTC 2015

Let me apologize in advanced for inter-mixing comments.

On 10/27/15 7:44 PM, Rainer Heilke wrote:

>> I am not trying to be a dick (it happens naturally), but if you cant
>> afford to backup terabytes of data, then you cant afford to have
>> terabytes of data.
>
> That is a meaningless statement, that reflects nothing in real-world 
> terms.
The true cost of a byte of data that you care about is the money you pay 
for the initial storage, and then the money you pay to back it up. For 
work, my front line databases have 64TB of mirrored net storage. The 
costs dont stop there. There is another 200TB of net storage dedicated 
to holding enough log data to rebuild the last 18 months from scratch. I 
also have two sets of slaves that snapshot themselves frequently. One 
set is a single disk, the other is raidz. These are not just backups. 
One set runs batch jobs, one runs the front end portal, and the masters 
are in charge of data ingestions.

The slaves are useful backups for zpool corruption on the front end but 
not necessarily for human error. For human error, say where someone 
destroys a table that replicates across all the slaves and some how isnt 
noticed until all the snapshots are deleted then we have the logs. I 
have different kinds of backups taken at different intervals to handle 
different kids of failures. Some are are live, some are snapshots, and 
some are source data. You need to determine your level of risk 
tolerance. That might mean using zfs send/recv to two different zpools 
with the same or different protection levels.

If you dont backup, you set yourself up for unrecoverable problems. In 
four years of running high transaction, high throughput databases on ZFS 
I have had to rebuild pools from time to time for different reasons. 
However, never for corruption. I have had other problems like unbalanced 
write load across vdevs and metaslab fragmentation. My point is, dont 
under estimate the costs of maintaining a byte of data. You might need 
the backup one day, even with protections that ZFS provides.

That said, instead of running mirrors run loose disks and backup to the 
second pool at a frequency you are comfortable with. You need to 
prioritize your resources against your risk tolerance. It is tempting to 
do mirrors because it is sexy but that might not be the best strategy.

>> This is just good stewardship of data you want to keep.
>
> That's an arrogant statement, presuming that if a person doesn't have 
> gobs of money, they shouldn't bother with computers at all.
I didnt write anything like that. What I am saying is you need to get 
more creative on how to protect your data. Yes, money makes it easier 
but you have options.
>
>> People who buy giant ass disks and then complain about how long it takes
>> to resilver a giant ass disk are out of their minds.
>
> I am not complaining about the time it takes; I know full well how 
> long it can take. I am complaining that the "resilvering" stops dead. 
> (More on this below.)
>
This is trickier. I dont recall you saying it stops dead. I thought it 
was just "slow."

When the scrub is stopped dead, what does "iostat -nMxC  1" look like? 
Are there drives indicating 100% busy? high wait or asvc_t times?

Do you have any controller errors? Does iostat -En report any errors?

Have you tried mounting the pool ro, stopping the scrub, and then 
copying data off?

Here are some hail mary settings that probably wont help. I offer them 
(in no particular order) to try to improve the scrub time performance, 
minimized the number of enqueued I/Os in case that is exacerbating the 
problem some how, and attempt to limit the amount of time spent on a 
failing I/O. Your scrubs may be stopping because you have a disk that 
exhibiting a poor failure mode. Namely, some sort of internal error and 
it just keeps retrying which makes the pool wedge. WD is not the brand I 
go to for enterprise failure modes.

* dont spend more than 8ms on any single i/o
set sd:sd_io_time=8
* resilver in 5 second intervals minimum
set zfs:zfs_resilver_min_time_ms = 5000
set zfs:zfs_resilver_delay = 0
* enqueue only 5 I/Os
set zfs:zfs_top_maxinflight = 2

Apply these settings and try to resilver again. If this doesnt work, dd 
the drives to new ones. Using dd will likely identify which drive is 
wedging ZFS as it will either not complete or it will error out.
>> I have no idea what happened to your system for you to loose three disks
>> simultaneously.
>
> This was covered in a thread ages ago; the tech took days to find the 
> problem, which was a CMOS battery that was on Death's door.
>
I am not sure who the tech is, but at least two people on this list told 
you check the CMOS battery. I think Bob and I both recommended changing 
the battery. Others might have as well.

>> I just dont see you recovering from this scenario where you have
>> two bad drives trying to resilver from each other.
>
> They aren't trying to resilver from each other. The dead disk is gone. 
> The good disk is trying to resilver from the ether. Or some such. 
> (Itself?) I added a third drive to the mirror in a vain attempt to get 
> past the error saying there weren't enough remaining mirrors when I 
> tried to zpool detach the now non-existent drive. Again, what is IT 
> trying to resilver from? The same Twilight Zone the first disk is 
> trying to resilver from?

I reviewed your output again. You have two disks in a mirror. Each disk 
is resilvering. This means the two disks are resilvering from each 
other. There are no other options as there are no other vdevs in the pool.

> It seems to think that the one disk is fine, but the data isn't. ZFS 
> is then locking the pool's I/O, not letting me clear up the damaged 
> files (nor the pool). It's like there's a trapped loop between two 
> parts of the ZFS code, but I refuse to believe Cantrill (and the many 
> programmers since) didn't see this kind of problem.
>
ZFS suspects both disks could be dirty, that's why it is resilvering 
both drives. This puts the drives under heavy load. This load is 
probably surfacing an internal error on one or more drives but because 
WD has crappy failure modes it is not sending the error to OS. 
Internally, the drive keeps retrying with errors and the OS keeps 
waiting for the drive to return from write flush. This is likely what is 
wedging the pool. The problem is likely on the drive -- but I cant say 
that with certainty. Certainty is a big word.

There is another option, which has the potential to make your data 
diverge across the two disks if you dont mount them read-only.

Basically, reboot the system with just one of the vdevs installed and 
mounted read-only. There will be nothing for the system to resilver. You 
should be able to copy your data off to another disk, (or delete the 
corrupted files to restore send/recv functionality) if the disk you 
choose at random is working properly. If it is not working properly, 
wash-rinse-repeat with the other disk. Hopefully that one will work. If 
neither work, try changing cables or controllers although there is a 
chance that both WD drives are failed. I have had bad luck with WD.

If you are in the SF bay area and want to bring it by my office I am 
happy to take a stab it, after you back up your original disks (for 
liability reasons). I can provide a clean working "text book" system if 
needed, bring your system and drives and we can likely salvage it one 
way or another.

j.