[OpenIndiana-discuss] [zfs] problem on my zpool

Wed Oct 23 19:00:52 UTC 2013

The first resilvering repported 2 faulted drives, I shutdown the server 
(directly via the button) the second worked with no error
zpool status
   pool: nas
  state: ONLINE
   scan: resilvered 1.45G in 5h34m with 0 errors on Wed Oct 23 20:18:47 2013
config:

         NAME                         STATE     READ WRITE CKSUM
         nas                          ONLINE       0     0     0
           raidz1-0                   ONLINE       0     0     0
             c8t50024E9004993E6Ed0p0  ONLINE       0     0     0
             c8t50024E92062E7524d0    ONLINE       0     0     0
             c8t50024E900495BE84d0p0  ONLINE       0     0     0
             c8t50014EE25A5EEC23d0p0  ONLINE       0     0     0
             c8t50024E9003F03980d0p0  ONLINE       0     0     0
             c8t50014EE2B0D3EFC8d0    ONLINE       0     0     0
             c8t50014EE6561DDB4Cd0p0  ONLINE       0     0     0
             c8t50024E9003F03A09d0p0  ONLINE       0     0     0
           raidz1-1                   ONLINE       0     0     0
             c50t8d0                  ONLINE       0     0     0
             c2d0                     ONLINE       0     0     0
             c1d0                     ONLINE       0     0     0
             c50t11d0                 ONLINE       0     0     0
             c50t10d0                 ONLINE       0     0     0

errors: No known data errors

really really wierd.

The server is running a supermicro M/B, an intel i3 cpu and ecc memory. 
I thought using this kind of hardware would prevent me (mainly the ecc 
memory) from errors.
Can the problem come from the sas/sata controler. I have an ibm m1015 
(sas, for the first vdev) and a lsi (a cheap one, sata, for the second)

Le 23/10/2013 18:16, Richard Elling a écrit :
> On Oct 22, 2013, at 11:46 PM, Clement BRIZARD <clement at brizou.fr> wrote:
>
>> I cleared the "degraded" disk. we will see what happens  in 131hours
> Yes, clearing is the proper procedure.
> The predicted time to complete is usually wildly inaccurate until you get near the end
> of resilvering or scrubbing. The estimated time remaining is based on bandwidth, but
> the workload is limited by IOPS and throttling. If you read a file, it will be checked and
> repaired, if necessary, so you can continue to use the pool as it scans the older data.
>
> As to the root cause, more likely a common, transient fault. Think along the lines of
> power supplies, cables, flaky motherboard, etc. The disks themselves are likely to be
> fine. The original fault might or might not recur.
>   -- richard
>
>>   pool: nas
>> state: ONLINE
>> status: One or more devices is currently being resilvered.  The pool will
>> 	continue to function, possibly in a degraded state.
>> action: Wait for the resilver to complete.
>>   scan: resilver in progress since Wed Oct 23 08:25:56 2013
>>     2.23G scanned out of 22.2T at 48.6M/s, 133h22m to go
>>     6.10M resilvered, 0.01% done
>> config:
>>
>> 	NAME                         STATE     READ WRITE CKSUM     CAP            Product
>> 	nas                          ONLINE       0     0     0
>> 	  raidz1-0                   ONLINE       0     0     0
>> 	    c8t50024E9004993E6Ed0p0  ONLINE       0     0     0     2 TB           SAMSUNG HD204UI
>> 	    c8t50024E92062E7524d0    ONLINE       0     0     0     2 TB           SAMSUNG HD204UI
>> 	    c8t50024E900495BE84d0p0  ONLINE       0     0     0     2 TB           SAMSUNG HD204UI
>> 	    c8t50014EE25A5EEC23d0p0  ONLINE       0     0     0     2 TB           WDC WD20EARS-00M
>> 	    c8t50024E9003F03980d0p0  ONLINE       0     0     0     2 TB           SAMSUNG HD204UI
>> 	    c8t50014EE2B0D3EFC8d0    ONLINE       0     0     0     2 TB           WDC WD20EARX-00P
>> 	    c8t50014EE6561DDB4Cd0p0  ONLINE       0     0     0     2 TB           WDC WD20EARS-00M
>> 	    c8t50024E9003F03A09d0p0  ONLINE       0     0     0     2 TB           SAMSUNG HD204UI
>> 	  raidz1-1                   ONLINE       0     0     0
>> 	    c50t8d0                  ONLINE       0     0     0  (resilvering)     2 TB           ST2000DL004 HD20
>> 	    c2d0                     ONLINE       0     0     0  (resilvering)     2 TB
>> 	    c1d0                     ONLINE       0     0     0  (resilvering)     2 TB
>> 	    c50t11d0                 ONLINE       0     0     0     2 TB           SAMSUNG HD204UI
>> 	    c50t10d0                 ONLINE       0     0     0  (resilvering)     2 TB           SAMSUNG HD204UI
>>
>>
>>
>>
>> Le 23/10/2013 08:43, Clement BRIZARD a écrit :
>>> I woke up this morning and so you're messages, unfortunately I had to reboot, the server completely froze.
>>> Now I have that :
>>>
>>>   pool: nas
>>> state: DEGRADED
>>> status: One or more devices is currently being resilvered.  The pool will
>>>     continue to function, possibly in a degraded state.
>>> action: Wait for the resilver to complete.
>>>   scan: resilver in progress since Wed Oct 23 08:19:42 2013
>>>     5.81G scanned out of 22.2T at 49.2M/s, 131h43m to go
>>>     15.6M resilvered, 0.03% done
>>> config:
>>>
>>>     NAME                         STATE     READ WRITE CKSUM
>>>     nas                          DEGRADED     0     0     0
>>>       raidz1-0                   DEGRADED     0     0     0
>>>         c8t50024E9004993E6Ed0p0  ONLINE       0     0     0
>>>         c8t50024E92062E7524d0    ONLINE       0     0     0
>>>         c8t50024E900495BE84d0p0  ONLINE       0     0     0
>>>         c8t50014EE25A5EEC23d0p0  ONLINE       0     0     0
>>>         c8t50024E9003F03980d0p0  ONLINE       0     0     0
>>>         c8t50014EE2B0D3EFC8d0    ONLINE       0     0     0
>>>         c8t50014EE6561DDB4Cd0p0  DEGRADED     0     0     0  too many errors
>>>         c8t50024E9003F03A09d0p0  ONLINE       0     0     0
>>>       raidz1-1                   ONLINE       0     0     0
>>>         c50t8d0                  ONLINE       0     0     0 (resilvering)
>>>         c2d0                     ONLINE       0     0     0 (resilvering)
>>>         c1d0                     ONLINE       0     0     0 (resilvering)
>>>         c50t11d0                 ONLINE       0     0     0
>>>         c50t10d0                 ONLINE       0     0     0 (resilvering)
>>>
>>>
>>>
>>>
>>>
>>> Le 23/10/2013 08:00, Jason Matthews a écrit :
>>>> first, dont reboot. if you do you might not be able remount the pool. the data you see is from the disks that are functioning. listing the files and copying complete files are two different things. if you dont have a backup you may need to copy whatever partial data you can from the broken pool.
>>>>
>>>> now let's start by getting the disks back in good shape.
>>>>
>>>> clear the degraded disk
>>>> zpool clear c8t50014EE6561DDB4Cd0p0
>>>>
>>>> reseat the missing disks in the hopes they come back then clear them
>>>>
>>>> check cfgadm -al and make sure they are connected and configured
>>>>
>>>> when you reseat them check the messages (or dmesg) to see if the system notices the re-insertion. if it does see the disk installed clear the disks in the pool in effort to bring the pool back to an operational state.
>>>>
>>>> Sent from Jasons' hand held
>>>>
>>>> On Oct 22, 2013, at 5:04 PM, Clement BRIZARD <clement at brizou.fr> wrote:
>>>>
>>>>> Hello everybody,
>>>>> I have a problem with my pool, I had some slowdowns lately on my nfs share of my zfs pool. A weekly scrub began and is still running but it worries me, it currently returne that
>>>>>
>>>>>   pool: nas
>>>>> state: UNAVAIL
>>>>> status: One or more devices are faulted in response to IO failures.
>>>>> action: Make sure the affected devices are connected, then run 'zpool clear'.
>>>>>    see: http://illumos.org/msg/ZFS-8000-HC
>>>>>   scan: scrub in progress since Sun Oct 20 19:29:23 2013
>>>>>     15.2T scanned out of 22.2T at 84.0M/s, 24h5m to go
>>>>>     1.29G repaired, 68.67% done
>>>>> config:
>>>>>
>>>>>     NAME                         STATE     READ WRITE CKSUM
>>>>>     nas                          UNAVAIL     63     2     0 insufficient replicas
>>>>>       raidz1-0                   DEGRADED     0     0     0
>>>>>         c8t50024E9004993E6Ed0p0  ONLINE       0     0     0
>>>>>         c8t50024E92062E7524d0    ONLINE       0     0     0
>>>>>         c8t50024E900495BE84d0p0  ONLINE       0     0     0
>>>>>         c8t50014EE25A5EEC23d0p0  ONLINE       0     0     0
>>>>>         c8t50024E9003F03980d0p0  ONLINE       0     0     1 (repairing)
>>>>>         c8t50014EE2B0D3EFC8d0    ONLINE       0     0     0
>>>>>         c8t50014EE6561DDB4Cd0p0  DEGRADED     0     0   211 too many errors  (repairing)
>>>>>         c8t50024E9003F03A09d0p0  ONLINE       0     0    18 (repairing)
>>>>>       raidz1-1                   UNAVAIL    131     9     0 insufficient replicas
>>>>>         c50t8d0                  REMOVED      0     0     0 (repairing)
>>>>>         c2d0                     ONLINE       0     0     0 (repairing)
>>>>>         c1d0                     ONLINE       0     0     0 (repairing)
>>>>>         c50t11d0                 ONLINE       0     0     0 (repairing)
>>>>>         c50t10d0                 REMOVED      0     0     0
>>>>>
>>>>> errors: 10972861 data errors, use '-v' for a list
>>>>>
>>>>>
>>>>> really weird, I haven't disconnected any disk. For several hours even if it said that the pool was unavailable I was browsing on it via nfs. I can't anymore.
>>>>>
>>>>>
>>>>> What do you think I should do ?
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> OpenIndiana-discuss mailing list
>>>>> OpenIndiana-discuss at openindiana.org
>>>>> http://openindiana.org/mailman/listinfo/openindiana-discuss
>>>> _______________________________________________
>>>> OpenIndiana-discuss mailing list
>>>> OpenIndiana-discuss at openindiana.org
>>>> http://openindiana.org/mailman/listinfo/openindiana-discuss
>>>
>>> _______________________________________________
>>> OpenIndiana-discuss mailing list
>>> OpenIndiana-discuss at openindiana.org
>>> http://openindiana.org/mailman/listinfo/openindiana-discuss
>>
>> _______________________________________________
>> OpenIndiana-discuss mailing list
>> OpenIndiana-discuss at openindiana.org
>> http://openindiana.org/mailman/listinfo/openindiana-discuss
> --
>
> Richard.Elling at RichardElling.com
> +1-760-896-4422
>
>
>
> _______________________________________________
> OpenIndiana-discuss mailing list
> OpenIndiana-discuss at openindiana.org
> http://openindiana.org/mailman/listinfo/openindiana-discuss