[OpenIndiana-discuss] OI Crash

Sašo Kiselkov skiselkov.ml at gmail.com
Sat Jan 19 01:33:56 UTC 2013


On 01/19/2013 01:53 AM, DormitionSkete at hotmail.com wrote:
> From 1992 to I used to 1998, I used to work at the Denver Museum of Natural History -- now the Denver Museum of Nature and Science.  We had two or three DEC Vax's and an AIX machine there.  It was their policy that once a week we had to power each of the servers all the way down to clear out any memory problems -- or whatever -- as preventive maintenance.  
> 
> Since then, I've always had the habit of setting up a cron job to reboot my servers once a week.  It's not as good as a full power down, but it's better than nothing.  And in all these years, I've never had to deal with intermittent problems like this, except for a few brief times when I used Red Hat Linux ten plus years ago.  (I've tried most of Red Hat's versions since 6.2, and RHEL 6 is the first version I've found that runs decent enough on our hardware, and that I'm happy enough with, for us to use.)

Nice anecdote, but I find this kind of policy very strange. Sure,
regular maintenance downtime windows are important, but doing to preempt
any problems in the OS seems just strange... not to mention that a
powercycle needlessly stresses the electromechanical components of the
server (HDD motors, fans, etc.)

Also, I don't know about VAX, but boot on a typical SPARC machine can
easily take upwards of 10 minutes (or more, depending on the level of
checks you enabled). Sun E10ks were famous for booting over half an hour
(checking all of their complicated hardware took a lot of time).

> So, if you can do it, you might want try setting up a cron job to reboot your server once a week -- or every night.  I reboot our LTSP thin client server every night just because it gets hit with running lots of desktop applications that I think give it a greater potential for these kinds of memory problems.  

How about just killing these apps (e.g. forced logout of users) rather
than rebooting the whole machine? Do you suspect memory problems in the
base OS services?

> On the other hand, we have all of our websites hosted on one of our parishioner's servers -- and he doesn't reboot his machines periodically like I do -- and about every two months, I have to call him up and tell him something is wrong.

I suggest switching hosting providers, as your server admin apparently
has next to no idea of what he's doing. I've been running web servers
for years without any trouble. Only the most drastic changes should
warrant a reboot (e.g. kernel update).

  And he goes and powers down his system -- sometimes he has to even
unplug it -- and then turn it back on, and everything works again.

What's up with this Windows 95-era powercycling voodoo? You are
obviously dealing with a serious issue and ignoring it.

> I know there are system admins that just love to brag about how great their up-times are on their machines -- but this might just save you a lot of time and grief.

Frequent rebooting and powercycling might have worked for you, but lots
of applications don't allow for that. Don't mistake an admin's pride of
a job well done for bragging.

> Of course, if you're running a real high-volume server, this might not be workable for you; but it only takes 2-5 minutes or so to reboot... Perhaps in the middle of the night you might be able to spare it being down that short time?

This is just plastering over the problem - I've seen plenty of
"solutions" of this kind where the restart frequency of a service slowly
had to increase until it was no longer workable. In general, I'd
recommend doing what you say only as the absolute last option.

> Just a friendly suggestion.
> Shared experience.
> 
> I know others may tell you that that's no longer necessary anymore in these more modern times; but my experience has been otherwise.
> 
> I hope it helps.

When you do encounter these kinds of problems, try and capture a crash
dump, file an Illumos issue and provide as much info on the problem as
possible to help debug it (that's what I recommended to David, he has
yet to respond). Nothing will improve if users keep issues to
themselves. I've been dealing with a serious (show stopper) network load
problem in Illumos a while back and after a little googling, mailing and
testing I managed to resolve it. Sticking one's head in the sand isn't a
good avenue of progress.

Anyway, just my two cents..

Cheers,
--
Saso



More information about the OpenIndiana-discuss mailing list