[oi-dev] [developer] BMC driver on Illumos

Garrett D'Amore garrett.damore at dey-sys.com
Thu Mar 28 19:12:22 UTC 2013


On Mar 28, 2013, at 12:05 PM, Jim Klimov <jimklimov at cos.ru> wrote:

> So, for the case of dedicated-hardware watchdogs, this is the part of
> your post which I can't find as relevant: "The usual thing is to hook
> this up to a system timer, which will catch hard hangs."

What I mean is that what most systems do is not express an API out to userland, but just have something that runs out of the timer that tickles the hardware watchdog register.  This guards against the hard hang of the entire system/scheduler, but it does nothing to ensure that some upper layer services are still being handled.

Now I've not looked at Linux and how it uses watchdogs… but I've experience with a few different embedded systems, and the above handling is almost precisely what I've seen done.  NetBSD was nice because it instead offered a watchdog facility that extended into userland, allowing the service check to be done by a userland daemon, which is far more interesting than just that the clock interrupt handler is still working properly. :-)

	- Garrett

> 
> Sorry for the long ramble,
> //Jim
> 
> On 2013-03-28 18:21, Garrett D'Amore wrote:
>> 
>> On Mar 28, 2013, at 9:39 AM, Jim Klimov <jimklimov at cos.ru> wrote:
>> 
>>> On 2013-03-28 16:18, Sašo Kiselkov wrote:
>>>> I'm building a system that's relying as much as possible on stock parts,
>>>> so custom kernel modules and hacking is something I'd like to avoid. I'm
>>>> not going to be around forever to keep the system going, or to
>>>> continually work on ways of deploying an old hack on a new install.
>>> 
>>> I know *you* do have better contributions to make, but a watchdog driver
>>> is AFAIK about knowing what byte to write to what IO port to set, reset
>>> and query the timeout, and possibly configure what the watchdog does
>>> when the timer expires without updates. This info might be gleaned from
>>> Linux and BSD drivers for different watchdog chips.
>>> 
>>> I think it might be a useful project for a student to make.
>>> 
>>> Possibly too low-profile for a GSoC, but good to learn about driver
>>> development, porting code, etc. And quite useful for the community ;)
>>> As a result of such a project, we'd get one more kernel-hacker ;)
>> 
>> I've done such work for NetBSD systems.  These things are usually pretty trivial from a hardware standpoint.
>> 
>> The harder thing is when these things are exposed as "registers" that are on an otherwise bog-standard part.  In that case, you have to either modify an existing driver, or come up with some more tricky hack.  (Its easier when this function is exposed as a separate PCI function or something like that.  But that's very rarely the case with something like this.  Usually they are part of the low level system chipset -- they kind of need be in order to do something like generate an NMI or cause a power reset.)
>> 
>> Then the other side of the problem is determining how you are going to trigger this.  The usual thing is to hook this up to a system timer, which will catch hard hangs.  But many "apparent" hangs are really not hangs in this sense -- there could be a high-priority process that is starving other processing for example, or a deadlock in the filesystem.  Those kinds of "hangs" won't be detected by such a deadman.
>> 
>> The ideal type of design would be to have a user-space accessible deadman, that allowed user processes to configure, and then tickle the deadman to keep it alive.  This would allow you to have a critical user space process validate that *it* is still serving whatever it needs to.  This kind of task requires a little design work -- and probably should be hooked back into some common deadman framework.  NetBSD has such a framework if I recall correctly.  This project would be in-scope for GSoC effort, because I can see a few other options like using the system timer as a deadman (its already there btw!) if no other hardware watchdog is present.  The framework should abstract all those and present a single syscall or ioctl interface to manage it.
> 
> 
> 
> _______________________________________________
> oi-dev mailing list
> oi-dev at openindiana.org
> http://openindiana.org/mailman/listinfo/oi-dev





More information about the oi-dev mailing list