[OpenIndiana-discuss] NFS exported dataset crashes the system

Alberto Picón Couselo alpicon1 at gmail.com
Wed Apr 24 18:58:51 UTC 2013


I can confirm you that we have disabled all power saving features of the 
boxes. However, I can't assure that CPU C states are totally disabled.

Anyway, we have changed to NFSv4 to test the system stability. The PHP 
process reads a folder with a huge number of hashed files and folders 
and creates a tarball, deleting the copy afterwards. As you comment, we 
think it could be due to some kind of locking/highio NFSv3 related issue...

If we create local users in /etc/passwd and /etc/groups, can you please 
tell us how to refresh NFSv4 server to update the user mapping table in 
Openindiana?. How do you face this issue?. If we restart the NFS service 
in Openindiana, using /etc/init.d/nfs restart, will NFSv4 clients 
reconnect or will they enter in a unstable state?.

Thank you very much in advance,

El 24/04/2013 20:11, Peter Wood escribió:
> First thing I'll do is to go in the BIOS and disable CPU C states and 
> disable all power saving features. If that doesn't help then try NFSv4.
>
> The reason I disable CPU C states is because of previous experience 
> with OpenSolaris on Dell boxes about 2yr ago. It will crash the system 
> in similar fashion. There are multiple reports on the Internet about 
> this and for sure that solution worked for us. To be on the safe side 
> I do the same on the Supermicro boxes.
>
> We switched to NFSv4 about two days ago and so far no crash. I'll be 
> more confident that this is the fix for us after running for at least 
> 5 days with no crash.
>
> I wish I had the resources to do more tests. Unfortunately all I can 
> tell right now is that crashes are happening on SuperMicro hardware 
> but not Dell, and the trigger is exporting one particular directory 
> via NFSv3. I don't think it is the high IOPS. More likely it is 
> related to the way the directory is used. What we do is we move files 
> and directories around and re-point symlinks while everything has been 
> accessed from the clients and we do this every 15min.
> Something like: mv nfsdir/targetdir nfsdir/targetdir.old; mv 
> nfsdir/targetdir.new nfsdir/targetdir.
>
> To me it looks more like locking issue then high IOPS issue.
>
>
> On Tue, Apr 23, 2013 at 11:26 PM, Alberto Picón Couselo 
> <alpicon1 at gmail.com <mailto:alpicon1 at gmail.com>> wrote:
>
>     Hi.
>
>     We have almost the same hardware as yours and we had the same
>     issue. We have exported a ZFS pool to three Xen VMs Debian 6.0
>     mounted using NFSv3. When one of these boxes launches a highio PHP
>     process to create a backup, it creates, copies and deletes a large
>     amount of files. The box just crashed the same way as yours,
>     during the deletion process, no ping, no log, no response at all.
>     We had to do a cold restart unplugging system power cords...
>
>     We have changed to NFSv4 hoping to fix this issue. Can please
>     comment your results regarding this issues?
>
>     Any help would be kindly appreciated.
>
>     Best Regards,
>
>          I've asked the ZFS discussion list for help on this but now I
>         have more
>          information and it looks like a bug in the drivers or something.
>
>          I have number of Dell PE R710 and PE 2950 servers running
>         OpenSolaris, OI
>          151a and OI 151a.7. All these systems are used as storage
>         servers, clean OS
>          install, no extra services running. The systems are NFS
>         exporting a lot of
>          ZFS datasets that are mounted on about ten CentOS-5.9 systems.
>
>          The above setup has been working for 2+ years with no problem.
>
>          Recently we bought two Supermicro systems:
>           Supermicro X9DRH-iF
>           Xeon E5-2620 @ 2.0 GHz 6-Core
>           128GB RAM
>           LSI SAS9211-8i HBA
>           32x 3TB Hitachi HUS723030ALS640, SAS, 7.2K
>
>          I installed OI151.a.7 on them and started migrating data from
>         the old Dell
>          servers (zfs send/receive).
>
>          Things have been working great for about two months until I
>         migrated one
>          particular directory to one of the new Supermicro systems and
>         after about
>          two days the system crashed. No network connectivity, black
>         console, no
>          response to keyboard keys, no activity lights (no error
>         lights either) on
>          the chassis. The only way out is to hit the reset button.
>         Nothing in the
>          logs as far as I can tell. Log entries just stop when the
>         system crashes.
>
>          In the following two months I did a lot of testing and a lot
>         of trips to
>          the colo in the middle of the night and the observation is
>         that regardless
>          of the OS everything works on the Dell servers. As soon as I
>         move that
>          directory to any of the Supermicro servers with OI151.a.7 it
>         will crash
>          them within 2 hours up to 5 days.
>
>          The Supermicro servers can be idle, exporting nothing, or can
>         be exporting
>          15+ other directories with high IOPS and working for months
>         with no
>          problems but as soon as I have them export that directory
>         they'll crash in
>          5 days the most.
>
>          There is only one difference between that directory an all
>         others exported
>          directories. One of the client systems that mounts it and
>         writes to it is
>          an old Debian 5.0 system. No idea why that would crash a
>         Supermicro system
>          but not a Dell system.
>
>          We worked directly with LSI developers and upgraded the
>         firmware to some
>          unpublished, prerelease development version to no avail. We
>         disabled all
>          power saving features and CPU C states in the BIOS and
>         nothing changed.
>
>          Any idea?I had a similar kind of problem where a VirtualBox
>         Freebsd 9.1 VM could hang
>
>     the server.
>     It had /usr/src and /usr/obj NFS mounted from the OI a7 box it was
>     running on.
>     The are separate NFS shared datasets in on of my 3 pools.
>
>     When I ran a make buildworld in that VM it consistently locked up
>     the OI host,
>     no console access,
>     no network access ( not even ping ).
>     As a test I switched to NFSv4 instead of NFSv3 and I have not seen
>     a hang since.
>
>


More information about the OpenIndiana-discuss mailing list