[OpenIndiana-discuss] Split-root installations

Jim Klimov jimklimov at cos.ru
Fri Nov 29 13:53:26 UTC 2013


Hello again,

Your arguments are quite interesting and valid, and took me quite a
while to think of proper answers. Sorry in advance for a long post...

On 2013-11-28 01:22, James Carlson wrote:
> Life's way too short to be optimizing a $50 component.

$50 times the number of deployments... and, until recently (with these
network/physical services) I believed that my part of the work was
mostly done anyway, and its results (the scripted workaround and docs)
are now available to the community for free anyhow.

I may also be biased in pursuing this, because most of the job is done
and can be used. Discarding it would be sad, and at least in my own
deployments I have no reason to not continue setting up like this.

But, thanks to this discussion and more digging due to it, I see some
use-cases where split setups can break (for networked or non-ZFS
boots). Mostly this may/would break regardless of my fixes, however.
And, somewhere in the post below, I discuss approaches to a more-or-less
generic fixing of the situation. I won't know how to test those fixes,
and I don't know if the configurations involved are used by anyone
nowadays with modern illumos or Solaris-based systems. Details below...


> For bulk data, I can get storage that's pennies per GB, and that does
> the job.
>
> As for L2ARC, I wouldn't put that on the same device as the system
> boot.  I'd rather run without an L2ARC if that's the choice.  Is that
> what you're doing?

Also a valid point, for big systems that can have many disks and whole
servers in different roles. How long has it been since you used or built
a consumerish system, with a limited amount of disk bays (say, 4 to 6)
and numerous economical constraints? Where the buyer is worried about
the overpaid premium (ZFS systems are quite over-spec'ed and expensive
compared to truly consumerish garbage-NAS sold today, and it is often
difficult to preach the benefits they might get for their overpay)?

Would you argue that illumos is only for big machines and datacenters
(yes, these applications of the technology pay for most of the
development now), or is it also "officially" applicable to small
setups - SOHO NAS, laptops, VM labs, etc.?

Again, this is not a business user who thinks in terms of ROI or support
or stuff like that. Real people need a workhorse which would work as
long as it can, be as cheap as it reasonably can, and every penny
invested should work too :)

As a recent example, I can suggest the HP N36L/N40L/N54L lineup where
a 5.25" bay can be converted to a 4/6*2.5"-disk cradle, but only two
are likely to get populated, at least initially, with a whopping
$100-a-piece 100GB SSD (or maybe even more whopping $300-a-piece
for a DC3700 with a much higher reliability rating), for all the
mirrored rpool/zil/l2arc usage of this box. With an overall budget
around $2000, which some consider too big already, the $100 or $300
component is a considerable part of the price.

On another side, the workload profile is also different from that of
the bigger systems. There is relatively little contention between the
rpool usage and L2ARC or even ZIL usage on the same SSD, just because
a SOHO user usually can't generate many enough streams of IO, and
users with large home datacenters are often technically literate and
can build their own NAS better tailored to their job, and even if it
is more expensive - they have sold to themselves the idea what they
are over-paying for, and are content to build something more correct.

Correct, cheap, performant - pick two? Heh, any triangle can be made
like this :)

> I guess we've got different optimization points at work.  Even at
> (say) $5/GB, I'm not interested in optimization if it means higher
> complexity that will undoubtedly result in a higher rate of failures,
> and higher cost of other goods -- all the extra testing and
> engineering required to support it won't be "free."

That is a valid point... Even if we would bluntly assume that my work
to support these setups today has been completed successfully, keeping
it that way would indeed be an effort. I am not sure how much a human
effort however - much of testing is done by automated software systems,
and spawning a new tested release with installation on a monolithic
root or a split-root would not be a huge difference. At least, turning
one into another is simply a scripted sequence detailed in the Wiki.

I might speculate further, that it would be sufficient to test only
the split-root setup, because if things do work with it - the monoroot
should (cough, almost?) certainly work, as a degenerate subset of this
local-storage split-root :)

I won't state that there should be no extra failures due to split-roots
because there are some, mostly due to later disregard to possibility of
such setups, and untangling them by workarounds and/or proposed proper
solutions - is what I discussed in this thread and in the Wiki page.
Luckily, so far I know of just a few problematic points, all of which
can be solved (and most have been already).

>> Maybe I chimed in early enough, that the system did not decay very
>> far from the state where it supported separate /usr filesystems,
>> and there are only a few exceptional cases, and there is still
>> support for it in the "trunk" codebase.
>
> I was referring to the work we were doing at Sun when I was there,
> from 2000 to 2009.

Well, the bulldozer of the progress did not tear everything down,
and the ruins which remained  were sufficient for my work in this
area for all those years :)

It may have been a lucky coincidence that I did not exceed some
particular constraints - only local filesystems (children of the
bootfs, referenced by mountpoint or legacy via /etc/vfstab, and
network/physical:default with legacy file-based setup instead of
NWAM or ipadm), but things just worked on a wide range of sol10
and later releases with ZFS root support.

 > You're right that if /usr is local, then there's not much to do before
 > it mounts.  Rather than moving random stuff into root or rewriting a
 > broken service into ksh93-isms, I think it makes a lot more sense to
 > look at why the service is starting too early and putting an end to
 > that.

As far as I have unearthed so far, OpenIndiana currently (oi_151a8)
delivers the following services and dependencies (excerpt):

1) The OS initialization passes through a "single-user" milestone
which makes sure all the basic filesystems have been mounted and
networking has started:

# svcs -d svc:/milestone/single-user:default
STATE          STIME    FMRI
online          0:30:27 svc:/network/loopback:default
online          0:30:31 svc:/system/identity:node
online          0:30:32 svc:/system/metainit:default
online          0:30:38 svc:/system/cryptosvc:default
online          0:30:49 svc:/system/keymap:default
online          0:30:49 svc:/system/filesystem/minimal:default
online          0:30:52 svc:/system/sysevent:default
online          0:30:54 svc:/milestone/devices:default
online          0:30:59 svc:/system/manifest-import:default
online         22:47:52 svc:/milestone/network:default

2) In particular, the networking milestone depends on any of the
two implementations of "physical" (legacy or NWAM) and on ipfilter
(I snipped several disabled ipsec services for brevity).

# svcs -d svc:/milestone/network:default
STATE          STIME    FMRI
disabled       Nov_27   svc:/network/ipfilter:default
disabled       Nov_27   svc:/network/physical:default
online         Nov_27   svc:/network/physical:nwam
online         Nov_27   svc:/network/loopback:default

The ipsec/* and ipfilter services depend on filesystem/usr or on
filesystem/minimal. However, the physical services do not depend
on filesystem services at all:

# svcs -d svc:/network/physical:default
STATE          STIME    FMRI
disabled       Nov_27   svc:/network/install:default
online         Nov_27   svc:/network/datalink-management:default
online         Nov_27   svc:/network/ip-interface-management:default
online         Nov_27   svc:/network/loopback:default
# svcs -d svc:/network/physical:nwam
STATE          STIME    FMRI
online         Nov_27   svc:/network/datalink-management:default
online         Nov_27   svc:/network/netcfg:default
online         Nov_27   svc:/network/ip-interface-management:default
online         Nov_27   svc:/network/loopback:default

# svcs -d svc:/network/loopback:default
STATE          STIME    FMRI
online         Nov_27   svc:/network/ip-interface-management:default
# svcs -d svc:/network/ip-interface-management:default
STATE          STIME    FMRI
# svcs -d svc:/network/datalink-management:default
STATE          STIME    FMRI

# svcs -d svc:/network/netcfg:default
STATE          STIME    FMRI
# svcs -d svc:/network/install:default
STATE          STIME    FMRI
online         Nov_27   svc:/network/ip-interface-management:default

Also among "problematic" ones was the IP tunneling service which is
an end-point in the graph, with similar dependencies:

# svcs -d iptun
STATE          STIME    FMRI
disabled       Nov_27   svc:/network/physical:default
disabled       Nov_27   svc:/network/ipsec/policy:default
online         Nov_27   svc:/network/ip-interface-management:default
online         Nov_27   svc:/network/physical:nwam
# svcs -D iptun
STATE          STIME    FMRI

Thus, the networking services rely on the rootfs provided by the
kernel to be sufficient for network init. Indeed, this works when
/usr is part of rootfs, and works if ksh93 is moved into rootfs and
builtin commands are used (as per my patches for physical:default
and iptun). This does not work well for the NWAM script because it
uses a lot of stuff that would be available per filesystem/minimal
(presence of /var is not guaranteed before that, for example).

3) On the opposite, the filesystem services do depend on either of
networking subsystems to complete. This makes sense for systems that
mount parts of their operating environment from network; and in this
case, to be pedantic, I don't understand why iptun was excluded from
the dependency graph - what if the system is booted over VPN? ;)

So, here is a pretty straightforward dependency chain:

# svcs -d svc:/system/filesystem/minimal:default
STATE          STIME    FMRI
online         Nov_27   svc:/system/filesystem/usr:default
online         Nov_27   svc:/system/device/local:default
# svcs -d svc:/system/filesystem/usr:default
STATE          STIME    FMRI
online         Nov_27   svc:/system/scheduler:default
online         Nov_27   svc:/system/boot-archive:default
# svcs -d svc:/system/boot-archive:default
STATE          STIME    FMRI
online         Nov_27   svc:/system/filesystem/root:default
# svcs -d svc:/system/scheduler:default
STATE          STIME    FMRI
online         Nov_27   svc:/system/filesystem/root:default
# svcs -d svc:/system/filesystem/root:default
STATE          STIME    FMRI
disabled       Nov_27   svc:/system/device/mpxio-upgrade:default
online         Nov_27   svc:/system/metainit:default
# svcs -d svc:/system/device/mpxio-upgrade:default
STATE          STIME    FMRI
online         Nov_27   svc:/system/metainit:default
# svcs -d svc:/system/metainit:default
STATE          STIME    FMRI
online         Nov_27   svc:/system/identity:node
# svcs -d svc:/system/identity:node
STATE          STIME    FMRI
disabled       Nov_27   svc:/network/physical:default
online         Nov_27   svc:/network/loopback:default
online         Nov_27   svc:/network/physical:nwam

I might even explain its rationale to myself based on the root
of the tree: the /lib/svc/method/identity-node script determines
the hostname from DHCP, RARP or /etc/hostname, and sets it for
the running operating environment. The first two methods might
indeed depend on already working LAN configuration, so here the
SMF dependency on networking/physical is not completely invalid.

Then /lib/svc/method/svc-metainit starts up and is used for the
(nowadays rare) case that SVM metadevices are used. Again, the
hostname can play a role in IO fencing and quorum stuff, so the
dependency is valid - as long as it is needed at all...

And finally the fs-root service tests if "/usr" was requested
in /etc/vfstab as cachefs, zfs or something else, and mounts it.
(With overlay-mount always per my fixes, and sometimes in the
default distro.) Then for a ZFS-rooted system it also tries to
find a non-legacy "usr" child dataset of the current bootfs
and mount it. Note that despite the service name "root", it
assumes that the root filesystem is already mounted and only
mounts the "usr". Later on, the fs-usr service checks if the
root and /usr are not mounted from local ZFS, and remounts
them "rw" for subsequent use.


So here is one weak point in the chain: metadevices may depend
on the hostname (identity) and that may depend on networking.
But the metainit service is needed only if they are at all
configured, and the hostname may be useful if the storage LUNs
are shared. Possibly, checks for all these conditions can be
remade into SMF services and/or dependency resources (for local
file tests) to automate the proper startup dependency chain for
a particular system based on the config files and netstrategy
that it has in practice at this boot-lifetime.


Does anyone here use SVM? At all (well, yeah, they might still
make sense for swap or dump independent of the ZFS kernel module,
or for use of legacy storage devices, so I won't disregard the
technology as completely useless)? But for roots in particular?
I am not sure if non-ZFS roots are really supported at all
(networked or local), from all the marketing hype of the past
decade. Given what I see with local zfs split-root, it is likely
that other "possible" techniques are even less possible today ;)

At least, I did not find any particular instructions for the OS
to mount the root filesystem from metadevices, at least not before
the fs-usr logic (which does consider those, and fails if the root
or usr metadevices are marked read-only in particular). That script
does explicitly mount the non-ZFS root and /usr resources as they
are referenced in /etc/vfstab, but does not really perform a check
that the /usr/* contents are valid (it may run /usr/sbin/bootadm
in some cases, which would fail and be fatal, but this is not an
always checked condition; so a root filesystem without any /usr
contents and no explicitly mounted device would fail later in a
less apparent manner).

I think that the further discussion of this should require a
diagram of some sort, to more visibly determine the dependencies
of the services involved in OS/Net configuration. But the short
takeaways would be that for a local-storage zfs rooted system
(including a zfs-based /usr as part of the rootfs or a dataset),
the network/physical and identity:node and optional metadevices
seemingly can start well after filesystem/minimal (gotta test).
If the mounted root filesystem is zfs, many of the other variants
for sub-components of the hierarchy are not even tried by methods.
The nodename is also irrelevant, at least for the root pool,
because it has already been imported and mounted, and possible
enforcement (zpool import -f) is behind, one way or another.

Other scenarios, if at all supported, such as boots from iSCSI
(iscsi/initiator depends on network, but is not a dependency
for fs-root), or nfs (binaries available, service not required),
or local or shared metadevices, may depend on networking and/or
on intermediate points in these dependencies. For example, a
local or remote boot from metadevices, using a /usr metadevice
from a shared DAS/NAS storate, might indeed need its nodename
which would be provided by DHCP/RARP rather than /etc/nodename
in the booted image. Esoteric? Plausible? Does anyone do that?

And whether those setups work or not (or if they should work) is
independent of the work I am doing for the case of local ZFS :)
Although revising the network-vs.-filesystem SMF dependencies
can indeed influence those setups somehow. TODO: Diagram! :)

As a separate decision, it may be argued that NWAM is for desktops
or preconfigured server farms, and is not(?) intended for such
extraordinary cases as networked boots, and should not become a
show-stopper for them?

> If I recall correctly, a fair bit of the mess was there to support PXE
> boot and the absurdly complex Solaris upgrade process.  But it's been
> a long time since those days.

I still don't know how much relevant is a fully fledged networked
boot today? It might be important for pre-built environments like
the installer image, but that would carry its own miniroot of all
files needed for autonomous work.

>>>> My worry here is that you're adding complexity to an already
>>>> complex system; instead, we need significant simplification.
>>>
>>> Another big +1 on that.
>>
>> I am sorry to hear this... I guess these would be two major nails
>> in the coffin for an attempt to RTI into the illumos-gate and
>> distros?
>
> I don't read the discussion that way.  I'm just confused about why the
> changes are a good idea and why some other analysis of the problem
> isn't needed.  I presume that someone filing that RTI would have good
> answers, but I'm not that person.
>
> In particular, I don't think it's at all wise to have services that
> start or restart other services as part of the normal boot sequence.
> That path leads to havoc.  There are special points during the install
> process when trickery like that is needed, but they're by nature
> "special."  If the svcadm insertions you've proposed are really
> required to make it work, then I think some higher-level rethink is
> needed.

Well, a large part of my research into the system "as it is" has
been laid out above.

The network/physical services as they are implemented today do not
fail in the SMF sense, despite complaining about lots of missing
programs under /usr, and allow the system to start up - but they
lack actual networking set-up. Restarting of these services after
the /usr is provided does allow them to work. Starting NWAM after
all of the /var/* namespace is guaranteed (fs-minimal) would likely
be even more reliable, in case NWAM would use the LDAP or NIS setup.
As we see from practice, bluntly assuming the whole rootfs to be
available at bootup, as NWAM does today, is error-prone and breaks
even when using (or slightly abusing) the configuration possibilities
that the system provides today.

Possibly, the "generic" solution can be presented by a series of
SMF services, with dependencies on one of them being online, and
the logic of the boot routine temp-enabling of temp-disabling a
set of these services based on the current boot's conditions -
i.e. the ZFS root, presence of certain programs in /usr/{s}bin
and other paths that are actually needed by a subsequent method,
the static or dynamic (DHCP/RARP) network setup, presence of
metadevice configs in the first place, presence of static mounts
in /etc/vfstab, etc.

It is possible that this dynamic mesh of prepared dependencies,
temporary activation of which calculated at boot from a particular
rootfs with its file-based (or DHCP/RARP-influenced) settings, can
be the clean generic solution.

Alternately, again, it is possible to fix the legacy network and
iptun startup to not depend on /usr but use the "ksh93'isms".
Maybe, the scripts should then explicitly depend on the particular
shell interpreter. NWAM might be split into a service which runs
before the filesystems as well and likewise only depends on the
root filesystem and ksh builtins, and would probably be able to
initialize some basic networking from static configs or DHCP,
and for other tasks (if the required resources are currently
absent) it temp-enables a new service which would run the same
method script and depend on filesystem/minimal, to reconfigure
the network after all the expected filesystems are guaranteed
to be available.


>> Alternately, much (not all) of this hassle could become irrelevant
>> if gzip-9 support just came for rootfs support :)
>
> Well, gee, that sounds like a more practical target to me!  The nice
> thing about doing that is that the work would have much narrower
> impact than a rototilling of the system services and /-vs-/usr
> contents, and might even be applicable to other projects.

I partially agree that this may be a better solution. It would
fulfill most of the reasons I have remaining to split / and /usr.

Except that I am not such a coder to complete this quest, so unlike
my presented solution which works (with just one exception found
so far), the GRUB support for reading gzip'ed bootfs datasets would
have to be implemented by someone else.

However, there are still pieces of the rootfs in /var that may
be split off (and are, in my practice) for example to share some data
files between BE's - like /var/adm, /var/log, /var/cores, /var/crash,
/var/mail, /var/spool/{client,}mqueue. Keeping these (and maybe others)
as separate datasets gives more administrative advantages than just
compression: the same files and mail queues can be used in different
BE's, which may be important for people who switch back and forth;
different dataset policies (zfs attributes) can be applied to datasets,
such as quotas to keep the /var/cores from overfilling the storage,
or disabling the zfs-auto-snap for these datasets...

This kind of split also allows to keep such potentially large and
write-active datasets out of the root pool and store them on other
storage - which can be important on pools with read-mostly media
(possibly even read-only, untested) and on pools with constrained
storage size.

Also, all of these come up late in boot (filesystem/minimal for /var,
/var/adm and /tmp, and filesystem/local for others) and it is still
important for the services which rely on presence of /var and /var/*
to start up after these services. Doing otherwise is their bug in
design.

Other notes on the split-off pieces of /var:

Possibly, /var/pkg or its components could be separated in a similar
manner - but I am not sure enough about the under-the-hood-workings
of pkg(5) to be certain about the benifits of the split. One thing
I do know is that the package cache in obsoleted BE's takes up a
considerable amount of disk space, which is apparently wasted as
contents are replaced during a package upgrade. I wonder if this cache
can contain packages from different releases, so one shared /var/pkg
could serve all revisions of the OS (at least, of "releases" with
differing package version numbers). That is something to experiment
with, and I don't know an answer and don't have practice here.

Separating /var/tmp did lead to problems in my earlier tests, maybe
this is because it is currently mounted (if is a separate FS) as
part of filesystem/local (after the single-user milestone), and some
services like ipfilter do try to use the directory "tmp" provided by
the $bootfs/var dataset (then either the usage fails due to protection
by immutability, or "mount -a" fails due to non-empty dir). Adding
it explicitly to filesystem/minimal (among the "/var /var/adm /tmp"
points specially processed today) would likely solve the problem.
The ipfilter service in particular does depend on filesystem/minimal.
While the filesystem/minimal does mount all child datasets of the
current bootfs, I believe I did not test splitting off the /var/tmp
in this manner - if I need it separated, I'd need it shared and not
wasting space in the upgraded BE snapshots.

If the split-root setup is ultimately RTI'able, it might make sense
to standardize the shared filesystem naming (i.e. $rpool/SHARED)
and explicitly mount the active dataset tree from under this point
as part of the filesystem/minimal (after /var and stuff), so that
in terms of SMF dependencies this would bring nothing new to the
expectations of rootfs-hierarchy availability.

I did not yet publish such changes to the filesystem SMF methods,
because that would introduce some new behaviours; so far my published
fixes limited themselves to having these scripts do the same job more
reliably.

-----------

Overall, the concise points for me so far:

1) GZIP(-9) support for bootfs in GRUB and kernel (beadm, etc.) does
solve some of the problem area, but not all.

2) There are services delivered today with improper dependencies.
This is their bug in design - i.e. NWAM initialization code does
practically depend on a proper minimal filesystem, while the FS
services are defined to SMF-depend on networking (NWAM if need be).
These dependencies may be applicable to a most-generic case that
would include networked boot, but for a local-storage boot at least
the NWAM service is severely mis-depended.

3) There are not so many components of the OS that must initialize
before single-user milestone (as its components) and/or before the
filesystem/minimal service, so adding support for the split-root
configuration should require a heavy toll on testing of "all software
that is integrated". If its scope is low-level system setup, then yes.
If it is application software which enjoys a successfully prepared
operating environment sandbox - probably nothing new for it.

4) There may be some components of the filesystems that might better
be explicitly mounted in filesystem/minimal (maybe including /var/tmp
and /var/pkg{/*}, maybe all of rpool/SHARED/* active dataset tree) in
order to allow optional "shared" storage of these files, and/or storage
in child datasets of the rootfs for quota enforcement etc., without
sacrificing bootability of the systems.

5) Support of the split-root setups brings more versatility and
options in deployment and maintenance of the systems, and would
especially benefit smaller systems where tradeoffs must be made,
and a brute-force approach "let's throw more bucks at it" does
not pass easily, or is not possible due to hardware constraints.
Yes, a non-trivial setup is more complex by definition. And this
allows it to be more flexible, which may be a good thing :)

I don't enforce everyone to use these configurations, I just ask
for fixes to support them to be included in the main distribution
so that they don't have to be bolted on into every distro or even
final setup. Though if they do get to the point of being polished
and recognized as a reliable setup - can't see why not adopt them...
but to each their own :)

Thanks, and I hope I did not miss many incomplete thoughts ,)
//Jim Klimov




More information about the OpenIndiana-discuss mailing list