[OpenIndiana-discuss] NFS exported dataset crashes the system

Paul van der Zwan paulz at vanderzwan.org
Wed Apr 10 17:40:01 UTC 2013


On 10 Apr 2013, at 16:46 , Marcel Telka <marcel at telka.sk> wrote:

> On Wed, Apr 10, 2013 at 04:35:06PM +0200, Paul van der Zwan wrote:
>> 
>> On 9 Apr 2013, at 3:13 , Peter Wood <peterwood.sd at gmail.com> wrote:
>> 
>>> I've asked the ZFS discussion list for help on this but now I have more
>>> information and it looks like a bug in the drivers or something.
>>> 
>>> I have number of Dell PE R710 and PE 2950 servers running OpenSolaris, OI
>>> 151a and OI 151a.7. All these systems are used as storage servers, clean OS
>>> install, no extra services running. The systems are NFS exporting a lot of
>>> ZFS datasets that are mounted on about ten CentOS-5.9 systems.
>>> 
>>> The above setup has been working for 2+ years with no problem.
>>> 
>>> Recently we bought two Supermicro systems:
>>> Supermicro X9DRH-iF
>>> Xeon E5-2620 @ 2.0 GHz 6-Core
>>> 128GB RAM
>>> LSI SAS9211-8i HBA
>>> 32x 3TB Hitachi HUS723030ALS640, SAS, 7.2K
>>> 
>>> I installed OI151.a.7 on them and started migrating data from the old Dell
>>> servers (zfs send/receive).
>>> 
>>> Things have been working great for about two months until I migrated one
>>> particular directory to one of the new Supermicro systems and after about
>>> two days the system crashed. No network connectivity, black console, no
>>> response to keyboard keys, no activity lights (no error lights either) on
>>> the chassis. The only way out is to hit the reset button. Nothing in the
>>> logs as far as I can tell. Log entries just stop when the system crashes.
>>> 
>>> In the following two months I did a lot of testing and a lot of trips to
>>> the colo in the middle of the night and the observation is that regardless
>>> of the OS everything works on the Dell servers. As soon as I move that
>>> directory to any of the Supermicro servers with OI151.a.7 it will crash
>>> them within 2 hours up to 5 days.
>>> 
>>> The Supermicro servers can be idle, exporting nothing, or can be exporting
>>> 15+ other directories with high IOPS and working for months with no
>>> problems but as soon as I have them export that directory they'll crash in
>>> 5 days the most.
>>> 
>>> There is only one difference between that directory an all others exported
>>> directories. One of the client systems that mounts it and writes to it is
>>> an old Debian 5.0 system. No idea why that would crash a Supermicro system
>>> but not a Dell system.
>>> 
>>> We worked directly with LSI developers and upgraded the firmware to some
>>> unpublished, prerelease development version to no avail. We disabled all
>>> power saving features and CPU C states in the BIOS and nothing changed.
>>> 
>>> Any idea?
>> 
>> I had a similar kind of problem where a VirtualBox Freebsd 9.1 VM could hang the server.
>> It had /usr/src and /usr/obj NFS mounted from the OI a7 box it was running on.
>> The are separate NFS shared datasets in on of my 3 pools.
>> 
>> When I ran a make buildworld in that VM it consistently locked up the OI host, no console access,
>> no network access ( not even ping ).
>> As a test I switched to NFSv4 instead of NFSv3 and I have not seen a hang since.
>> So it looked like a heavy NFSv3 load was the issue.
> 
> Please try to get a crash dump file when the system is in hung state.
> I'm interested to analyze the crash dump file.
> 
> 

When it hung the system would not respond to anything at all.
The only way out I could find was a hard reset or power cycle.

I do have the following in /etc/system:
set snooping=1
set pcplusmp:apic_panic_on_nmi=1
But that did not make a difference.

BTW the hang was/is reproducable, everytime I ran a make buildworld inside the VM it would hang.
I have tried a few make buildworlds now that I use NFSv4 and no hangs so far.

Regards, 

	Paul




More information about the OpenIndiana-discuss mailing list