[OpenIndiana-discuss] Broken zpool
jason matthews
jason at broken.net
Wed Oct 28 20:47:02 UTC 2015
Let me apologize in advanced for inter-mixing comments.
On 10/27/15 7:44 PM, Rainer Heilke wrote:
>> I am not trying to be a dick (it happens naturally), but if you cant
>> afford to backup terabytes of data, then you cant afford to have
>> terabytes of data.
>
> That is a meaningless statement, that reflects nothing in real-world
> terms.
The true cost of a byte of data that you care about is the money you pay
for the initial storage, and then the money you pay to back it up. For
work, my front line databases have 64TB of mirrored net storage. The
costs dont stop there. There is another 200TB of net storage dedicated
to holding enough log data to rebuild the last 18 months from scratch. I
also have two sets of slaves that snapshot themselves frequently. One
set is a single disk, the other is raidz. These are not just backups.
One set runs batch jobs, one runs the front end portal, and the masters
are in charge of data ingestions.
The slaves are useful backups for zpool corruption on the front end but
not necessarily for human error. For human error, say where someone
destroys a table that replicates across all the slaves and some how isnt
noticed until all the snapshots are deleted then we have the logs. I
have different kinds of backups taken at different intervals to handle
different kids of failures. Some are are live, some are snapshots, and
some are source data. You need to determine your level of risk
tolerance. That might mean using zfs send/recv to two different zpools
with the same or different protection levels.
If you dont backup, you set yourself up for unrecoverable problems. In
four years of running high transaction, high throughput databases on ZFS
I have had to rebuild pools from time to time for different reasons.
However, never for corruption. I have had other problems like unbalanced
write load across vdevs and metaslab fragmentation. My point is, dont
under estimate the costs of maintaining a byte of data. You might need
the backup one day, even with protections that ZFS provides.
That said, instead of running mirrors run loose disks and backup to the
second pool at a frequency you are comfortable with. You need to
prioritize your resources against your risk tolerance. It is tempting to
do mirrors because it is sexy but that might not be the best strategy.
>> This is just good stewardship of data you want to keep.
>
> That's an arrogant statement, presuming that if a person doesn't have
> gobs of money, they shouldn't bother with computers at all.
I didnt write anything like that. What I am saying is you need to get
more creative on how to protect your data. Yes, money makes it easier
but you have options.
>
>> People who buy giant ass disks and then complain about how long it takes
>> to resilver a giant ass disk are out of their minds.
>
> I am not complaining about the time it takes; I know full well how
> long it can take. I am complaining that the "resilvering" stops dead.
> (More on this below.)
>
This is trickier. I dont recall you saying it stops dead. I thought it
was just "slow."
When the scrub is stopped dead, what does "iostat -nMxC 1" look like?
Are there drives indicating 100% busy? high wait or asvc_t times?
Do you have any controller errors? Does iostat -En report any errors?
Have you tried mounting the pool ro, stopping the scrub, and then
copying data off?
Here are some hail mary settings that probably wont help. I offer them
(in no particular order) to try to improve the scrub time performance,
minimized the number of enqueued I/Os in case that is exacerbating the
problem some how, and attempt to limit the amount of time spent on a
failing I/O. Your scrubs may be stopping because you have a disk that
exhibiting a poor failure mode. Namely, some sort of internal error and
it just keeps retrying which makes the pool wedge. WD is not the brand I
go to for enterprise failure modes.
* dont spend more than 8ms on any single i/o
set sd:sd_io_time=8
* resilver in 5 second intervals minimum
set zfs:zfs_resilver_min_time_ms = 5000
set zfs:zfs_resilver_delay = 0
* enqueue only 5 I/Os
set zfs:zfs_top_maxinflight = 2
Apply these settings and try to resilver again. If this doesnt work, dd
the drives to new ones. Using dd will likely identify which drive is
wedging ZFS as it will either not complete or it will error out.
>> I have no idea what happened to your system for you to loose three disks
>> simultaneously.
>
> This was covered in a thread ages ago; the tech took days to find the
> problem, which was a CMOS battery that was on Death's door.
>
I am not sure who the tech is, but at least two people on this list told
you check the CMOS battery. I think Bob and I both recommended changing
the battery. Others might have as well.
>> I just dont see you recovering from this scenario where you have
>> two bad drives trying to resilver from each other.
>
> They aren't trying to resilver from each other. The dead disk is gone.
> The good disk is trying to resilver from the ether. Or some such.
> (Itself?) I added a third drive to the mirror in a vain attempt to get
> past the error saying there weren't enough remaining mirrors when I
> tried to zpool detach the now non-existent drive. Again, what is IT
> trying to resilver from? The same Twilight Zone the first disk is
> trying to resilver from?
I reviewed your output again. You have two disks in a mirror. Each disk
is resilvering. This means the two disks are resilvering from each
other. There are no other options as there are no other vdevs in the pool.
> It seems to think that the one disk is fine, but the data isn't. ZFS
> is then locking the pool's I/O, not letting me clear up the damaged
> files (nor the pool). It's like there's a trapped loop between two
> parts of the ZFS code, but I refuse to believe Cantrill (and the many
> programmers since) didn't see this kind of problem.
>
ZFS suspects both disks could be dirty, that's why it is resilvering
both drives. This puts the drives under heavy load. This load is
probably surfacing an internal error on one or more drives but because
WD has crappy failure modes it is not sending the error to OS.
Internally, the drive keeps retrying with errors and the OS keeps
waiting for the drive to return from write flush. This is likely what is
wedging the pool. The problem is likely on the drive -- but I cant say
that with certainty. Certainty is a big word.
There is another option, which has the potential to make your data
diverge across the two disks if you dont mount them read-only.
Basically, reboot the system with just one of the vdevs installed and
mounted read-only. There will be nothing for the system to resilver. You
should be able to copy your data off to another disk, (or delete the
corrupted files to restore send/recv functionality) if the disk you
choose at random is working properly. If it is not working properly,
wash-rinse-repeat with the other disk. Hopefully that one will work. If
neither work, try changing cables or controllers although there is a
chance that both WD drives are failed. I have had bad luck with WD.
If you are in the SF bay area and want to bring it by my office I am
happy to take a stab it, after you back up your original disks (for
liability reasons). I can provide a clean working "text book" system if
needed, bring your system and drives and we can likely salvage it one
way or another.
j.
More information about the openindiana-discuss
mailing list