[oi-dev] Are there any active healthchecks for SMF in general?

Wed Apr 18 14:33:40 UTC 2012

tl;dr

Seems like an upstream question (illumos-gate and/or illumos-userland).

Even in terms of the upstream, we don't really do RFEs unless there's
enough interest in the issue that someone's likely to write some code.

On Wed, Apr 18, 2012 at 12:37 PM, Jim Klimov <jimklimov at cos.ru> wrote:
> Hello all,
>
>  I wonder if there are any RFEs or on-going works regarding
> proactive health-checks for SMF services (test routine to be
> defined by the service author or packager and/or by local
> system admin)?
>
>  I think that just like there are "start", "stop", "refresh"
> methods and so on, there could also be a "healthcheck" method
> with its associated timeouts, as well as frequency of tests,
> tolerable amount of test failures in a row and/or within a
> given time range, etc. There could also be a policy to choose
> what to do if the healthcheck fails (too many times): offline
> the service, set it to maintenance, restart it, or smth else?
>
>  In fact, if the "healthcheck" method is validly defined, it
> should be fired after running the "start" method and only after
> a successful test the SMF service state should transfer from
> "offline*" to "online". Some service methods exit as soon as
> the target daemon has started, even though the service becomes
> useful after a few minutes.
>
>  I've had to script "clutches" like that for many different
> projects, usually involving a test routine fired from crontab
> or crafting a specialized startup script which includes needed
> checks on prerequisite services as well as startup real results.
>
>  As an example, think Apache Tomcat with its default start
> scripts - they exit after spawning JVM, but the user-required
> webapps can take minutes to initialize and start up. Currently
> SMF would "online" the service as soon as the script exited,
> and proceed to starting up the dependent services. However,
> the method is actually "online" for us generically when the
> servlet container has *logged* that its startup routine is
> complete. If other SMF services do depend on this Tomcat (say,
> it is running an OpenDJ LDAP server), it is "online" only when
> it responds correctly to LDAP queries, and not before.
>
>  In case of webserver SMF-services the tests usually request
> a healthcheck page or some other page and compare it with the
> expected "healthy" template. For DBMS or LDAP services that
> would be an SQL or ldapsearch query. In case of crontabs there
> are tricks (i.e. lockfiles) to forbid the test script from
> running in numerous parallel invokations if the tested service
> takes too long to respond.
>
>  Recently (in my vboxsvc[1] project for controlling the
> VirtualBox VMs as SMF service instances), I've taken a different
> approach and made a background loop initiated and executed by
> the service method script; part of that loop's job is to check
> whether the VM is not only running as a process on the Solaris
> host, but also provides the service it was booted for (if the
> test method was validly defined and configured and enabled).
> Originally the loop got there because the service is transient
> (due to VirtualBox internals) and SMF does not monitor the
> service's child processes, but we needed to monitor anyway
> whether the VMs are running or not, and stop the VM processes
> gracefully when the SMF service is stopped. Then things got
> expanded a bit... ;)
>
>  I wonder if it would be useful to generalize the solution
> and/or recode it in some more efficient manner to be available
> for all SMF services as an optional part of the framework?
> Theoretically it is there, somewhat - SMF already checks that
> child processes exist for "contract/wait" type of services,
> and none died on "bad signals" like coredumping.
>
>  What do you think? Would that logic be useful as generic part
> of SMF? Can it be left as (includable?) shell scripts and/or
> rewritten into perl for efficiency? Would anyone undertake to
> revise and rewrite the logic into C? ;)
>
> [1] http://sourceforge.net/projects/vboxsvc - my VBoxSvc project
>    which controls VMs as SMF service instances, with optional
>    healthchecks.
>
> [2]
> http://vboxsvc.svn.sourceforge.net/viewvc/vboxsvc/lib/svc/method/vbox.sh?content-type=text%2Fplain
>    The main script (keywords: KICKER vmsvccheck monitoring hook).
>
> Thanks for any ideas,
> //Jim Klimov
>
>
> _______________________________________________
> oi-dev mailing list
> oi-dev at openindiana.org
> http://openindiana.org/mailman/listinfo/oi-dev