[oi-dev] Are there any active healthchecks for SMF in general?

Wed Apr 18 11:37:21 UTC 2012

Hello all,

   I wonder if there are any RFEs or on-going works regarding
proactive health-checks for SMF services (test routine to be
defined by the service author or packager and/or by local
system admin)?

   I think that just like there are "start", "stop", "refresh"
methods and so on, there could also be a "healthcheck" method
with its associated timeouts, as well as frequency of tests,
tolerable amount of test failures in a row and/or within a
given time range, etc. There could also be a policy to choose
what to do if the healthcheck fails (too many times): offline
the service, set it to maintenance, restart it, or smth else?

   In fact, if the "healthcheck" method is validly defined, it
should be fired after running the "start" method and only after
a successful test the SMF service state should transfer from
"offline*" to "online". Some service methods exit as soon as
the target daemon has started, even though the service becomes
useful after a few minutes.

   I've had to script "clutches" like that for many different
projects, usually involving a test routine fired from crontab
or crafting a specialized startup script which includes needed
checks on prerequisite services as well as startup real results.

   As an example, think Apache Tomcat with its default start
scripts - they exit after spawning JVM, but the user-required
webapps can take minutes to initialize and start up. Currently
SMF would "online" the service as soon as the script exited,
and proceed to starting up the dependent services. However,
the method is actually "online" for us generically when the
servlet container has *logged* that its startup routine is
complete. If other SMF services do depend on this Tomcat (say,
it is running an OpenDJ LDAP server), it is "online" only when
it responds correctly to LDAP queries, and not before.

   In case of webserver SMF-services the tests usually request
a healthcheck page or some other page and compare it with the
expected "healthy" template. For DBMS or LDAP services that
would be an SQL or ldapsearch query. In case of crontabs there
are tricks (i.e. lockfiles) to forbid the test script from
running in numerous parallel invokations if the tested service
takes too long to respond.

   Recently (in my vboxsvc[1] project for controlling the
VirtualBox VMs as SMF service instances), I've taken a different
approach and made a background loop initiated and executed by
the service method script; part of that loop's job is to check
whether the VM is not only running as a process on the Solaris
host, but also provides the service it was booted for (if the
test method was validly defined and configured and enabled).
Originally the loop got there because the service is transient
(due to VirtualBox internals) and SMF does not monitor the
service's child processes, but we needed to monitor anyway
whether the VMs are running or not, and stop the VM processes
gracefully when the SMF service is stopped. Then things got
expanded a bit... ;)

   I wonder if it would be useful to generalize the solution
and/or recode it in some more efficient manner to be available
for all SMF services as an optional part of the framework?
Theoretically it is there, somewhat - SMF already checks that
child processes exist for "contract/wait" type of services,
and none died on "bad signals" like coredumping.

   What do you think? Would that logic be useful as generic part
of SMF? Can it be left as (includable?) shell scripts and/or
rewritten into perl for efficiency? Would anyone undertake to
revise and rewrite the logic into C? ;)

[1] http://sourceforge.net/projects/vboxsvc - my VBoxSvc project
     which controls VMs as SMF service instances, with optional
     healthchecks.

[2] 
http://vboxsvc.svn.sourceforge.net/viewvc/vboxsvc/lib/svc/method/vbox.sh?content-type=text%2Fplain
     The main script (keywords: KICKER vmsvccheck monitoring hook).

Thanks for any ideas,
//Jim Klimov