[OpenIndiana-discuss] Split-root installations

Wed Nov 27 11:16:05 UTC 2013

First of all, thank you for sharing your advice.
Coming from a distro maintainer, it is very valuable :)

On 2013-11-27 11:03, Peter Tribble wrote:
> On Wed, Nov 27, 2013 at 12:51 AM, Jim Klimov <jimklimov at cos.ru> wrote:
>> So far the best shot, both compact and effective, was fixing the
>> filesystem/root service like in the patch below. Are there any
>> fundamental objections to this sort of fix beside possible lack
>> of style? Getting the hands wet in the guts of the system anyhow,
>> can't keep them clean... ;)
>
> General comment - attempting to hack workarounds to broken
> systems at the wrong place in the startup sequence seems like
> a really bad idea.

Yes, maybe this is a workaround instead of a "proper solution"
with a re-architecture of the system startup with unknown
repercussions of the overhaul. I might not be the sort of
specialist (or don't have enough spare time) needed to do
the overhaul, but I can patch-in a working workaround which
lets systems work. Even if it is ugly.

Understanding the drawbacks and tradeoffs, which is what you
help me do, is of course important. But one of the matters
to consider is whether the stuff which may be broken after
applying my workarounds did/could work well before them?

> How can adding /usr change the validity of the network
> configuration? There shouldn't be any new drivers (or
> network booting wouldn't work at all), so there should be
> nothing to do here.

Again, this is a theoretical construct which depends on things
we can't really manage - delivery of drivers by their vendors.
For good or bad, the /usr/kernel/drv is a valid location...

Say, your small root delivers the illumos drivers like e1000g,
and this is enough to mount the rootfs (if networked, or this
doesn't matter at all for a local rootfs). Then your /usr comes
up with, say, a vendor-provided package of VendorX NIC drivers.
In fact, in case of a networked boot, this may be a different
/usr filesystem image from what you had in the miniroot archive
that the bootloader fetched from the network to load the kernel
and initialize the system far enough to get networked filesystems.
And in a split-root local filesystem you had no /usr at all.

Your network/physical service, be it legacy-default or nwam,
has already started, because it is earlier in SMF dependencies.
Let's assume for now that the bugs discussed now have been fixed,
and some NICs have been actually configured by this service.
This service is expected to have configured all NICs on the system.
Apparently, it could not plumb those NICs that it did not have
drivers for at that time...

So adding /usr can change the validity of the network config.

> If networking is broken at this stage (and it shouldn't be) then
> that's a bug elsewhere - fixing fs-root is the wrong place.

At the very least, and as I wrote before, I found this problem
*because* the "physical" startup methods fired before the separated
/usr was mounted. And their SMF dependencies require that actually,
which may be reasonable for network boots and is unreasonable for
fully local-device boots.

> restart = disable and enable. If your /usr is nfs mounted,
> what happens when you disable networking?

I see a good point here, thanks :)

So this trick with restarts should take place only in case that
I am testing - with the locally mounted /usr filesystem (though
how can we determine this well with i.e. iSCSI pools?), or perhaps
a check that there was no /usr/bin at start of the script and
there is one after it has done its work - actually this is what
I am trying to counter here?..

On another hand, I might guess (but shouldn't rely upon this)
that the already-started program (svcadm) from a /usr device
mounted over NFS would be loaded and cached in memory and won't
depend on networking to complete. After all, this is not much
different from a network interruption for any other cause, and
an NFS-mounted /usr would be mounted via /etc/vfstab (earlier
in the fs-root script) without reliance on SMF dependencies
like nfs/client or autofs, so restart of the network won't
cause troubles in race-conditioning as well.

> Again, if something is in maintenance then you need to fix the
> underlying cause. (And you don't check to see whether the
> clear was successful, nor do you wait for it to come online
> to avoid possible race conditions.)

Well, in fact during my initial testing when the problem was
detected, with legacy static files, legacy DHCP or NWAM DHCP,
the network/physical service never came up in maintenance.
There were lots of errors about not-found binaries in the
volatile logfile, and the network interfaces were not up and
not configured at all, but no SMF failures indeed.

After all, this piece was inspired by NWAM, which does lots
of SMF manipulation, both in service attributes and in active
management of instances.

Again, thanks for the constructive discussion.
You comments did push me to either seek and find, or at least
consider and consciously discard, other solutions, not once :)
//Jim