[oi-dev] [developer] BMC driver on Illumos

Thu Mar 28 17:21:17 UTC 2013

On Mar 28, 2013, at 9:39 AM, Jim Klimov <jimklimov at cos.ru> wrote:

> On 2013-03-28 16:18, Sašo Kiselkov wrote:
>> I'm building a system that's relying as much as possible on stock parts,
>> so custom kernel modules and hacking is something I'd like to avoid. I'm
>> not going to be around forever to keep the system going, or to
>> continually work on ways of deploying an old hack on a new install.
> 
> I know *you* do have better contributions to make, but a watchdog driver
> is AFAIK about knowing what byte to write to what IO port to set, reset
> and query the timeout, and possibly configure what the watchdog does
> when the timer expires without updates. This info might be gleaned from
> Linux and BSD drivers for different watchdog chips.
> 
> I think it might be a useful project for a student to make.
> 
> Possibly too low-profile for a GSoC, but good to learn about driver
> development, porting code, etc. And quite useful for the community ;)
> As a result of such a project, we'd get one more kernel-hacker ;)

I've done such work for NetBSD systems.  These things are usually pretty trivial from a hardware standpoint.

The harder thing is when these things are exposed as "registers" that are on an otherwise bog-standard part.  In that case, you have to either modify an existing driver, or come up with some more tricky hack.  (Its easier when this function is exposed as a separate PCI function or something like that.  But that's very rarely the case with something like this.  Usually they are part of the low level system chipset -- they kind of need be in order to do something like generate an NMI or cause a power reset.)

Then the other side of the problem is determining how you are going to trigger this.  The usual thing is to hook this up to a system timer, which will catch hard hangs.  But many "apparent" hangs are really not hangs in this sense -- there could be a high-priority process that is starving other processing for example, or a deadlock in the filesystem.  Those kinds of "hangs" won't be detected by such a deadman.

The ideal type of design would be to have a user-space accessible deadman, that allowed user processes to configure, and then tickle the deadman to keep it alive.  This would allow you to have a critical user space process validate that *it* is still serving whatever it needs to.  This kind of task requires a little design work -- and probably should be hooked back into some common deadman framework.  NetBSD has such a framework if I recall correctly.  This project would be in-scope for GSoC effort, because I can see a few other options like using the system timer as a deadman (its already there btw!) if no other hardware watchdog is present.  The framework should abstract all those and present a single syscall or ioctl interface to manage it.

	- Garrett
> 
> //Jim
> 
> 
> _______________________________________________
> oi-dev mailing list
> oi-dev at openindiana.org
> http://openindiana.org/mailman/listinfo/oi-dev