[OpenIndiana-discuss] [zfs] ZFS High-Availability and Sync Replication

Mon Nov 19 10:35:01 UTC 2012

On 11/18/2012 08:32 PM, Richard Elling wrote:
> more below...
> 
> On Nov 18, 2012, at 3:13 AM, Sašo Kiselkov <skiselkov.ml at gmail.com> wrote:
> 
>> On 11/17/2012 03:03 AM, Richard Elling wrote:
>>> On Nov 15, 2012, at 5:39 AM, Sašo Kiselkov <skiselkov.ml at gmail.com> wrote:
>>>
>>>> I've been lately looking around the net for high-availability and sync
>>>> replication solutions for ZFS and came up pretty dry - seems like all
>>>> the jazz is going around on Linux with corosync/pacemaker and DRBD. I
>>>> found a couple of tools, such as AVS and OHAC, but these seem rather
>>>> unmaintained, so it got me wondering what others use for ZFS clustering,
>>>> HA and sync replication. Can somebody please point me in the right
>>>> direction?
>>>
>>> Architecturally, replicating in this way is a bad idea. Past efforts to do 
>>> block-level replication suffer from one or more of:
>>> 	1. coherency != consistency
>>> 	2. performance sux without nonvolatile write caches
>>> 	3. network bandwidth is not infinite
>>> 	4. the speed of light is too slow
>>> 	5. replicating big chunks of data exacerbates #3 and #4
>>>
>>> AVS and friends worked ok for the time they were originally developed,
>>> when disks were 9, 18, or 36 GB. For a JBOD full of 4TB disks, it just isn't
>>> feasible.
>>>
>>> In the market, where you do see successes for block-level replication, the
>>> systems are constrained to avoid #2 and #5 (eg TrueCopy or SRDF).
>>>
>>> For most practical applications today, the biggest hurdle is #1, by far. 
>>> Fortunately, there are many solutions there: NoSQL, distributed databases,
>>> (HA-)IMDBs, etc.
>>>
>>> Finally, many use cases for block-level replication are solved in the metro
>>> area choosing the right hardware and using mirrors thus solving #1 and KISS
>>> at the same time.
>>
>> I understand that replication at the storage level is the Wrong Way(tm)
>> to do it, but I need to cover this scenario for the rare cases where the
>> application layer can't/won't do it themselves. Most specifically, I
>> need to replicate VM backing storage for VMs that can't do software
>> RAID-1 themselves (which is of course the best way).
>>
>> In any case, I'm just looking at what's available in the market now.
>> Ultimate I might go for shared storage + two heads + corosync/pacemaker
>> (I got 'em to compile on Illumos).
> 
> 
> If you are just building a HA cluster pair with shared storage, then there is
> significant work already done with RSF-1, OHAC, VCS, etc. I've looked at
> corosync in detail and it is a bit more DIY than the others. The easy part is
> getting the thing to work when a node totally fails... the hard part is that
> nodes rarely totally fail...

Naturally, it takes some testing and trial and error to get a cluster
suite designed, which is why I'm not trying to do it myself and am
looking for what others have done. Of the solutions you mention, after a
bit of research I have gotten the following impressions:

RSF-1:
  Pros:
    *) seems OI/Illumos aware and friendly
    *) commercial support available
  Cons:
    *) closed-source
    *) no downloadable trial version
    *) no price on website, which complicates market research

OHAC:
  Pros:
    *) open-source & free
    *) Sun project, so probably well integrated with Solaris OSes
  Cons:
    *) dead, or at least the public part of it
    *) documentation links dead or lead to Oracle walled gardens

VCS:
  Pros:
    *) commercial support available
  Cons:
    *) deeply proprietary (down to the L2 interconnect protocol)
    *) no price on website

Of these, OHAC seems like the best bet, because we can try and apply it
freely and back in the day we could get additional peace of mind by it
being backed by Sun (and thus we could get commercial support from a
reputable vendor, if need be) - sadly that is no longer the case.

Cheers,
--
Saso