[OpenIndiana-discuss] Errors without errors

Thu Aug 5 13:46:08 UTC 2021

Hi Michelle

For User home files you will need a backup anyway. For system 
Consistency you can use `pkg fix` to restore the system image to a known 
state in a new Boot environment.

Greetings
Till

On 05.08.21 05:14, Toomas Soome via openindiana-discuss wrote:
> 
> 
>> On 5. Aug 2021, at 11:11, Michelle <michelle at msknight.com> wrote:
>>
>> I removed the drive in order to a backup before I start messing around
>> with things, which is why it isn't in the iostat. The backup will take
>> probably until early evening.
>>
>> This is what happened from messages around that time. Almost looks like
>> whatever happened, it rebooted.
>>
> 
>  From those, I’d say, you need to replace that disk.
> 
> rgds,
> toomas
> 
>>
>> Aug  5 01:55:01 jaguar smbd[601]: [ID 617204 daemon.error] Can't get
>> SID for ID=0 type=1, status=-9977
>> Aug  5 01:58:00 jaguar ahci: [ID 296163 kern.warning] WARNING: ahci0:
>> ahci port 3 has task file error
>> Aug  5 01:58:00 jaguar ahci: [ID 687168 kern.warning] WARNING: ahci0:
>> ahci port 3 is trying to do error recovery
>> Aug  5 01:58:00 jaguar ahci: [ID 693748 kern.warning] WARNING: ahci0:
>> ahci port 3 task_file_status = 0x4041
>> Aug  5 01:58:00 jaguar ahci: [ID 657156 kern.warning] WARNING: ahci0:
>> error recovery for port 3 succeed
>> Aug  5 01:58:09 jaguar ahci: [ID 296163 kern.warning] WARNING: ahci0:
>> ahci port 3 has task file error
>> Aug  5 01:58:09 jaguar ahci: [ID 687168 kern.warning] WARNING: ahci0:
>> ahci port 3 is trying to do error recovery
>> Aug  5 01:58:09 jaguar ahci: [ID 693748 kern.warning] WARNING: ahci0:
>> ahci port 3 task_file_status = 0x4041
>> Aug  5 01:58:09 jaguar ahci: [ID 657156 kern.warning] WARNING: ahci0:
>> error recovery for port 3 succeed
>> Aug  5 02:00:15 jaguar ahci: [ID 296163 kern.warning] WARNING: ahci0:
>> ahci port 3 has task file error
>> Aug  5 02:00:15 jaguar ahci: [ID 687168 kern.warning] WARNING: ahci0:
>> ahci port 3 is trying to do error recovery
>> Aug  5 02:00:15 jaguar ahci: [ID 693748 kern.warning] WARNING: ahci0:
>> ahci port 3 task_file_status = 0x4041
>> Aug  5 02:00:16 jaguar ahci: [ID 657156 kern.warning] WARNING: ahci0:
>> error recovery for port 3 succeed
>> Aug  5 02:00:20 jaguar ahci: [ID 296163 kern.warning] WARNING: ahci0:
>> ahci port 3 has task file error
>> Aug  5 02:00:20 jaguar ahci: [ID 687168 kern.warning] WARNING: ahci0:
>> ahci port 3 is trying to do error recovery
>> Aug  5 02:00:20 jaguar ahci: [ID 693748 kern.warning] WARNING: ahci0:
>> ahci port 3 task_file_status = 0x4041
>> Aug  5 02:00:20 jaguar ahci: [ID 657156 kern.warning] WARNING: ahci0:
>> error recovery for port 3 succeed
>> Aug  5 02:00:24 jaguar ahci: [ID 296163 kern.warning] WARNING: ahci0:
>> ahci port 3 has task file error
>> Aug  5 02:00:24 jaguar ahci: [ID 687168 kern.warning] WARNING: ahci0:
>> ahci port 3 is trying to do error recovery
>> Aug  5 02:00:24 jaguar ahci: [ID 693748 kern.warning] WARNING: ahci0:
>> ahci port 3 task_file_status = 0x4041
>> Aug  5 02:00:24 jaguar ahci: [ID 657156 kern.warning] WARNING: ahci0:
>> error recovery for port 3 succeed
>> Aug  5 02:00:24 jaguar ahci: [ID 811322 kern.info] NOTICE: ahci0:
>> ahci_tran_reset_dport port 3 reset device
>> Aug  5 02:00:29 jaguar ahci: [ID 296163 kern.warning] WARNING: ahci0:
>> ahci port 3 has task file error
>> Aug  5 02:00:29 jaguar ahci: [ID 687168 kern.warning] WARNING: ahci0:
>> ahci port 3 is trying to do error recovery
>> Aug  5 02:00:29 jaguar ahci: [ID 693748 kern.warning] WARNING: ahci0:
>> ahci port 3 task_file_status = 0x4041
>> Aug  5 02:00:29 jaguar ahci: [ID 657156 kern.warning] WARNING: ahci0:
>> error recovery for port 3 succeed
>> Aug  5 02:00:34 jaguar ahci: [ID 296163 kern.warning] WARNING: ahci0:
>> ahci port 3 has task file error
>> Aug  5 02:00:34 jaguar ahci: [ID 687168 kern.warning] WARNING: ahci0:
>> ahci port 3 is trying to do error recovery
>> Aug  5 02:00:34 jaguar ahci: [ID 693748 kern.warning] WARNING: ahci0:
>> ahci port 3 task_file_status = 0x4041
>> Aug  5 02:00:34 jaguar ahci: [ID 657156 kern.warning] WARNING: ahci0:
>> error recovery for port 3 succeed
>> Aug  5 02:00:38 jaguar ahci: [ID 296163 kern.warning] WARNING: ahci0:
>> ahci port 3 has task file error
>> Aug  5 02:00:38 jaguar ahci: [ID 687168 kern.warning] WARNING: ahci0:
>> ahci port 3 is trying to do error recovery
>> Aug  5 02:00:38 jaguar ahci: [ID 693748 kern.warning] WARNING: ahci0:
>> ahci port 3 task_file_status = 0x4041
>> Aug  5 02:00:38 jaguar ahci: [ID 657156 kern.warning] WARNING: ahci0:
>> error recovery for port 3 succeed
>> Aug  5 02:00:53 jaguar fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-
>> 8000-FD, TYPE: Fault, VER: 1, SEVERITY: Major
>> Aug  5 02:00:53 jaguar EVENT-TIME: Thu Aug  5 02:00:53 UTC 2021
>> Aug  5 02:00:53 jaguar PLATFORM: ProLiant-MicroServer, CSN: 5C7351P4L9,
>> HOSTNAME: jaguar
>> Aug  5 02:00:53 jaguar SOURCE: zfs-diagnosis, REV: 1.0
>>
>>
>> On Thu, 2021-08-05 at 11:03 +0300, Toomas Soome via openindiana-discuss
>> wrote:
>>>> On 5. Aug 2021, at 10:52, Michelle <michelle at msknight.com> wrote:
>>>>
>>>> Thanks for this. So I'm possibly better off rolling back the OS
>>>> snapshot after my backup has finished?
>>>
>>> maybe, maybe not. first of all, I have no idea to what point the
>>> rollback would be.
>>>
>>> secondly; the system has seen some errors, at this time, the fault
>>> is, it does not tell us if those were checksum errors or something
>>> else, and it seems to me, it is something else.
>>>
>>> and this is why: if you look on your zpool output, you see report
>>> about c6t3d0, but iostat -En below, it does not include c6t3d0. It
>>> seems to be missing.
>>>
>>> what do you get from: 'iostat -En c6t3d0’ ?
>>>
>>> Also, it would be good idea to check /var/adm/messages, are there any
>>> SATA or IO related messages around august 05. 02:00?
>>>
>>> FMA definitely has recorded an issue about pool, so there must be
>>> something going on.
>>>
>>> rgds,
>>> toomas
>>>
>>>> I have removed the drive for the moment, and am running a backup.
>>>> Just
>>>> in case :-)
>>>>
>>>> mich at jaguar:~$ iostat -En
>>>> c5d1             Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
>>>> Model: INTEL SSDSA2M04 Revision:  Serial No: CVGB949301PC040
>>>> Size: 40.02GB <40019116032 bytes>
>>>> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
>>>> Illegal Request: 0
>>>> c6t1d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
>>>> Vendor: ATA      Product: WDC WD40EZRZ-00G Revision: 0A80 Serial
>>>> No:
>>>> WD-WCC7K5UK24LJ
>>>> Size: 4000.79GB <4000787030016 bytes>
>>>> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
>>>> Illegal Request: 0 Predictive Failure Analysis: 0
>>>> c6t0d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
>>>> Vendor: ATA      Product: WDC WD60EFRX-68L Revision: 0A82 Serial
>>>> No:
>>>> WD-WX21DA84EH0F
>>>> Size: 6001.18GB <6001175126016 bytes>
>>>> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
>>>> Illegal Request: 0 Predictive Failure Analysis: 0
>>>> c6t2d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
>>>> Vendor: ATA      Product: WDC WD60EFRX-68L Revision: 0A82 Serial
>>>> No:
>>>> WD-WX51DB880RJ4
>>>> Size: 6001.18GB <6001175126016 bytes>
>>>> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
>>>> Illegal Request: 0 Predictive Failure Analysis: 0
>>>>
>>>>
>>>> --------------- ------------------------------------  -------------
>>>> - --
>>>> -------
>>>> TIME            EVENT-ID                              MSG-
>>>> ID         SEVERITY
>>>> --------------- ------------------------------------  -------------
>>>> - --
>>>> -------
>>>> Aug 05 02:00:53 c5934fd6-5f4b-409e-b0f8-8f44ea8f99c4  ZFS-8000-
>>>> FD    Major
>>>>
>>>> Host        : jaguar
>>>> Platform    : ProLiant-MicroServer      Chassis_id  : 5C7351P4L9
>>>> Product_sn  :
>>>>
>>>> Fault class : fault.fs.zfs.vdev.io
>>>> Affects     : zfs://pool=jaguar/vdev=740c01ae0d3c3109
>>>>                  faulted and taken out of service
>>>> Problem in  : zfs://pool=jaguar/vdev=740c01ae0d3c3109
>>>>                  faulted and taken out of service
>>>>
>>>> Description : The number of I/O errors associated with a ZFS device
>>>> exceeded
>>>>                     acceptable levels.  Refer to
>>>>              http://illumos.org/msg/ZFS-8000-FD for more
>>>> information.
>>>>
>>>> Response    : The device has been offlined and marked as
>>>> faulted.  An
>>>> attempt
>>>>                     will be made to activate a hot spare if
>>>> available.
>>>>
>>>> Impact      : Fault tolerance of the pool may be compromised.
>>>>
>>>> Action      : Run 'zpool status -x' and replace the bad device.
>>>>
>>>>
>>>>
>>>> On Thu, 2021-08-05 at 10:22 +0300, Toomas Soome via openindiana-
>>>> discuss
>>>> wrote:
>>>>>> On 5. Aug 2021, at 09:35, Michelle <michelle at msknight.com>
>>>>>> wrote:
>>>>>>
>>>>>> Hi Folks,
>>>>>>
>>>>>> About a month ago I updated my Hipster...
>>>>>> SunOS jaguar 5.11 illumos-ca706442e6 i86pc i386 i86pc
>>>>>>
>>>>>> This morning it was absolutely crawling. Couldn't even connect
>>>>>> via
>>>>>> SSH
>>>>>> and had to bounce the box.
>>>>>>
>>>>>> It was reporting a drive as faulted, but didn't give any
>>>>>> numbers...
>>>>>> everything was 0. I'm now not sure what happened and whether
>>>>>> the
>>>>>> drive
>>>>>> is good, or whether I should roll back the OS.
>>>>>>
>>>>>> (and the drive WD Red 6TB (not shingle) went out of warrantee a
>>>>>> week
>>>>>> ago. How about that, eh?)
>>>>>>
>>>>>> Grateful for any opinions please.
>>>>>>
>>>>>> Thu  5 Aug 04:00:01 UTC 2021
>>>>>> NAME   SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DED
>>>>>> UP
>>>>>> HE
>>>>>> ALTH  ALTROOT
>>>>>> lion  5.45T  5.28T   176G        -         -     4%    96%  1.0
>>>>>> 0x
>>>>>> DEGR
>>>>>> ADED  -
>>>>>> pool: jaguar
>>>>>> state: DEGRADED
>>>>>> status: One or more devices are faulted in response to
>>>>>> persistent
>>>>>> errors.
>>>>>> 	Sufficient replicas exist for the pool to continue
>>>>>> functioning
>>>>>> in a
>>>>>> 	degraded state.
>>>>>> action: Replace the faulted device, or use 'zpool clear' to
>>>>>> mark
>>>>>> the
>>>>>> device
>>>>>> 	repaired.
>>>>>> scan: scrub in progress since Thu Aug  5 00:00:00 2021
>>>>>> 	6.00T scanned at 428M/s, 5.02T issued at 358M/s, 7.90T
>>>>>> total
>>>>>> 	1M repaired, 63.59% done, 0 days 02:20:17 to go
>>>>>> config:
>>>>>> 	NAME        STATE     READ WRITE CKSUM
>>>>>> 	jaguar      DEGRADED     0     0     0
>>>>>> 	  raidz1-0  DEGRADED     0     0     0
>>>>>> 	    c6t0d0  ONLINE       0     0     0
>>>>>> 	    c6t2d0  ONLINE       0     0     0
>>>>>> 	    c6t3d0  FAULTED      0     0     0  too many
>>>>>> errors  (repairing)
>>>>>>
>>>>>
>>>>> Can you postoutput from:
>>>>> iostat -En
>>>>> fmadm faulty
>>>>>
>>>>> in any case, there definitely is bug about error reporting -
>>>>> counters
>>>>> are zero while “too many errors” is reported.
>>>>>
>>>>> rgds,
>>>>> toomas
>>>>> _______________________________________________
>>>>> openindiana-discuss mailing list
>>>>> openindiana-discuss at openindiana.org
>>>>> https://openindiana.org/mailman/listinfo/openindiana-discuss
>>>>
>>>> _______________________________________________
>>>> openindiana-discuss mailing list
>>>> openindiana-discuss at openindiana.org
>>>> https://openindiana.org/mailman/listinfo/openindiana-discuss
>>>
>>> _______________________________________________
>>> openindiana-discuss mailing list
>>> openindiana-discuss at openindiana.org
>>> https://openindiana.org/mailman/listinfo/openindiana-discuss
>>
>>
>> _______________________________________________
>> openindiana-discuss mailing list
>> openindiana-discuss at openindiana.org
>> https://openindiana.org/mailman/listinfo/openindiana-discuss
> 
> 
> _______________________________________________
> openindiana-discuss mailing list
> openindiana-discuss at openindiana.org
> https://openindiana.org/mailman/listinfo/openindiana-discuss
>