[OpenIndiana-discuss] Server hangs weekly

Milan Jurik milan.jurik at xylab.cz
Fri Feb 24 13:45:03 UTC 2012


Hi,

On 23.02.2012 06:37, oimltalk at skidde.net wrote:
> Interesting. I have 8GB of memory, which I thought would be enough 
> for my
> purposes (6 x 2TB drives, RAIDZ-2, no deduping or anything special 
> like
> that). Most of the time the server is on it's idle, but perhaps the 
> regular
> snapshots are causing problems. I'll have to try to track memory 
> usage and
> see what happens.
>

ZFS ARC is very aggressive in memory consumption. Together with other 
system caches it can consume a lot of free space and it is very lazy to 
give it back to system. And there seem to be some very bad reactions of 
system to OOM.

> As for NWAM, I believe I'm using it? I haven't turned it off, anyway.
>

so it could explain why you loose network. It could be NWAM daemon 
itself or some bad behavior in network stack giving info to NWAM daemon 
that network card lost link and later it is unable to reconfigure card. 
Probably separate bug.

Best regards,

Milan

> Thanks for the feedback.
>
> On Wed, Feb 22, 2012 at 5:27 PM, Daniel Kjar <dkjar at elmira.edu> 
> wrote:
>
>> I have this problem on a system that I was using to back up 50 gbs 
>> of
>> material each night.  It would transfer that across the network in 
>> zfs and
>> that would kill it but it would only happen after a week or so of 
>> nightly
>> updates of roughly the same size.  This machine has 32gb of ram and 
>> a cp
>> process would hang and swallow it all bringing the system to its 
>> knees.  I
>> just stopped that big transfer job and called it a night.  I am no 
>> longer
>> backing up my files to 3 different buildings but that is better than
>> crashing my sunray server every 5 days.
>>
>>
>> On 2/22/2012 1:48 AM, Milan Jurik wrote:
>>
>>> Hi,
>>>
>>> one of my systems was suffering from very similar symptoms. I had 
>>> no
>>> chance to debug it much as it was on remote site in serverhouse. 
>>> But in my
>>> case it was lack of memory, system was under significant memory 
>>> pressure. I
>>> was unable to reproduce it on small systems I have at home. I added 
>>> some
>>> memory and set limits for zones.
>>>
>>> One small suggestion - could you write small script dumping memory 
>>> info
>>> (from kernel mdb) and list of processes to the disk and run it from 
>>> crontab
>>> every few minutes? Maybe it will be unable to store data during 
>>> "hang" but
>>> at least you could see trend.
>>>
>>> For lost IP address - are you using NWAM?
>>>
>>> Best regards,
>>>
>>> Milan
>>>
>>> On 22.02.2012 07:32, oimltalk at skidde.net wrote:
>>>
>>>> Hi there,
>>>>
>>>> I'm seeing roughly weekly hangs on a server running OpenIndiana 
>>>> 151a. I'm
>>>> using it primarily as a home fileserver with ZFS.
>>>>
>>>> The exact behavior seems to depend on when I notice it, but 
>>>> essentially
>>>> the
>>>> server drops off the network and is only variably responsive when 
>>>> I try
>>>> to
>>>> access the console directly. Sometimes when this happens the 
>>>> system
>>>> doesn't
>>>> respond at all (e.g., not even to keyboard input). One time I was 
>>>> able to
>>>> interact with the console (after the server had disappeared from 
>>>> the
>>>> network) and tried to see what was going on. Tried pinging
>>>> google.com(unreachable, as expected). Next I tried `ifconfig -a` 
>>>> and
>>>> got this:
>>>>
>>>> lo0: 
>>>> flags=2001000849<UP,LOOPBACK,**RUNNING,MULTICAST,IPv4,**VIRTUAL>
>>>> mtu 8232
>>>> index 1
>>>>        inet 127.0.0.1 netmask ff000000
>>>> e1000g0: 
>>>> flags=1040843<UP,BROADCAST,**RUNNING,MULTICAST,DEPRECATED,**IPv4>
>>>> mtu
>>>> 1500 index 2
>>>>        inet 0.0.0.0 netmask ff000000
>>>>
>>>>
>>>> which explains the lack of connectivity. But after it printed that 
>>>> it
>>>> didn't return. The console still printed my keyboard output 
>>>> (including
>>>> ^C,
>>>> ^Z, etc.), and there was still output coming from other sources 
>>>> (e.g., I
>>>> have napp-it running regular snapshots, so I saw a notice that it 
>>>> had
>>>> used
>>>> sudo to run that) but I couldn't get a prompt back. Next I tried 
>>>> hitting
>>>> the power button on the machine I got this:
>>>>
>>>> poweroff: initiated by user on /dev/console
>>>> in.ndpd[994]: phyint_reach_random: SIOCSLIFLNKINFO (interfac 
>>>> e1000g0):
>>>> Interrupted system call
>>>> bootadm: /boot/solaris/bin/extract_**boot_filelist is not owned by 
>>>> 101,
>>>> skipping
>>>> syncing file systems... done
>>>> WARNING: Power off requested from power button or SC, powering 
>>>> down the
>>>> system!
>>>>
>>>>
>>>> followed shortly by:
>>>>
>>>> WARNING: Failed to shut down the system!
>>>>
>>>>
>>>> Tried looking through the logs for anything interesting but didn't 
>>>> come
>>>> up
>>>> with anything, though to be honest I'm not 100% sure where to look 
>>>> or
>>>> what
>>>> to look for. When the machine drops off the network I can still 
>>>> access it
>>>> via IPMI (tried this using both the dedicated jack on the 
>>>> motherboard and
>>>> by sharing the Intel NIC--worked in both cases, but OI was still
>>>> unresponsive), so I doubt it's a bad NIC. Motherboard is a 
>>>> Supermicro
>>>> X9SCM-F.
>>>>
>>>> I know that at least sometimes the system will stop running even 
>>>> my ZFS
>>>> snapshots via napp-it, since I've come back to a frozen console 
>>>> that
>>>> showed
>>>> the last snapshot being taken 12+ hours before (they're supposed 
>>>> to be
>>>> taken every 15 minutes). My guess is this is just because it takes 
>>>> me
>>>> longer to notice sometimes--seems like it's hitting a deadlock 
>>>> somewhere
>>>> that eventually grinds everything to a halt (like with the 
>>>> ipconfig call
>>>> above).
>>>>
>>>> Also, FWIW, here's what ipconfig -a gets me when it works 
>>>> correctly (MAC
>>>> address removed, although interestingly it wasn't even printed in 
>>>> the
>>>> output above):
>>>>
>>>> lo0: 
>>>> flags=2001000849<UP,LOOPBACK,**RUNNING,MULTICAST,IPv4,**VIRTUAL>
>>>> mtu 8232
>>>> index 1
>>>>        inet 127.0.0.1 netmask ff000000
>>>> e1000g0: flags=1040843<UP,BROADCAST,**RUNNING,MULTICAST,DHCP,IPv4> 
>>>> mtu
>>>> 1500
>>>> index 2
>>>>        inet 192.168.10.10 netmask ffffff00
>>>>        ether [MAC address here]
>>>> lo0: 
>>>> flags=2001000849<UP,LOOPBACK,**RUNNING,MULTICAST,IPv6,**VIRTUAL>
>>>> mtu 8252
>>>> index 1
>>>>        inet6 ::1/128
>>>> e1000g0: flags=20002004841<UP,RUNNING,**MULTICAST,DHCP,IPv6> mtu 
>>>> 1500
>>>> index 2
>>>>        inet6 fe80::225:90ff:fe50:2c2a/10
>>>>        ether [MAC address here]
>>>>
>>>>
>>>> Any ideas/suggestions on where to go from here? Thanks in advance.
>>>>
>>>>
>>>



More information about the OpenIndiana-discuss mailing list