[OpenIndiana-discuss] NFS exported dataset crashes the system
Paul van der Zwan
paulz at vanderzwan.org
Thu Apr 11 16:12:11 UTC 2013
On 11 Apr 2013, at 0:29 , Peter Wood <peterwood.sd at gmail.com> wrote:
> On Wed, Apr 10, 2013 at 7:35 AM, Paul van der Zwan <paulz at vanderzwan.org>wrote:
>
>>
>> On 9 Apr 2013, at 3:13 , Peter Wood <peterwood.sd at gmail.com> wrote:
>>
>>> I've asked the ZFS discussion list for help on this but now I have more
>>> information and it looks like a bug in the drivers or something.
>>>
>>> I have number of Dell PE R710 and PE 2950 servers running OpenSolaris, OI
>>> 151a and OI 151a.7. All these systems are used as storage servers, clean
>> OS
>>> install, no extra services running. The systems are NFS exporting a lot
>> of
>>> ZFS datasets that are mounted on about ten CentOS-5.9 systems.
>>>
>>> The above setup has been working for 2+ years with no problem.
>>>
>>> Recently we bought two Supermicro systems:
>>> Supermicro X9DRH-iF
>>> Xeon E5-2620 @ 2.0 GHz 6-Core
>>> 128GB RAM
>>> LSI SAS9211-8i HBA
>>> 32x 3TB Hitachi HUS723030ALS640, SAS, 7.2K
>>>
>>> I installed OI151.a.7 on them and started migrating data from the old
>> Dell
>>> servers (zfs send/receive).
>>>
>>> Things have been working great for about two months until I migrated one
>>> particular directory to one of the new Supermicro systems and after about
>>> two days the system crashed. No network connectivity, black console, no
>>> response to keyboard keys, no activity lights (no error lights either) on
>>> the chassis. The only way out is to hit the reset button. Nothing in the
>>> logs as far as I can tell. Log entries just stop when the system crashes.
>>>
>>> In the following two months I did a lot of testing and a lot of trips to
>>> the colo in the middle of the night and the observation is that
>> regardless
>>> of the OS everything works on the Dell servers. As soon as I move that
>>> directory to any of the Supermicro servers with OI151.a.7 it will crash
>>> them within 2 hours up to 5 days.
>>>
>>> The Supermicro servers can be idle, exporting nothing, or can be
>> exporting
>>> 15+ other directories with high IOPS and working for months with no
>>> problems but as soon as I have them export that directory they'll crash
>> in
>>> 5 days the most.
>>>
>>> There is only one difference between that directory an all others
>> exported
>>> directories. One of the client systems that mounts it and writes to it is
>>> an old Debian 5.0 system. No idea why that would crash a Supermicro
>> system
>>> but not a Dell system.
>>>
>>> We worked directly with LSI developers and upgraded the firmware to some
>>> unpublished, prerelease development version to no avail. We disabled all
>>> power saving features and CPU C states in the BIOS and nothing changed.
>>>
>>> Any idea?
>>
>> I had a similar kind of problem where a VirtualBox Freebsd 9.1 VM could
>> hang the server.
>> It had /usr/src and /usr/obj NFS mounted from the OI a7 box it was running
>> on.
>> The are separate NFS shared datasets in on of my 3 pools.
>>
>> When I ran a make buildworld in that VM it consistently locked up the OI
>> host, no console access,
>> no network access ( not even ping ).
>> As a test I switched to NFSv4 instead of NFSv3 and I have not seen a hang
>> since.
>> So it looked like a heavy NFSv3 load was the issue.
>>
>> Paul
>>
>>
> Make sense. I haven't tried that.
>
> If I'm correct ZFS on OI supports NFSv2,3 and 4.
>
> By switching to NFSv4 you mean that on your client machine (the FreeBSD VM)
> you setup the NFS client to use NFSv4 protocol. Do I understand this
> correctly? Or, did you do something on the OI server to accept only NFSv4
> connections?
>
> Could you please give more information.
I haven't changed the server but only the mount options on on the client.
It's /etc/fstab now has:
192.168.178.24:/data/ports /usr/ports nfs rw,nfsv4 - -
192.168.178.24:/data/src /usr/src nfs rw,nfsv4 - -
192.168.178.24:/data/obj /usr/obj nfs rw,nfsv4 - -
A make buildworld does seem to take quite a bit longer than when I was using nfsv3 so it might just be a case
of a lighter load. I have no hard data but it feels like it takes twice as long.
Paul
More information about the OpenIndiana-discuss
mailing list