[OpenIndiana-discuss] Errors without errors

Thu Aug 5 08:14:52 UTC 2021

> On 5. Aug 2021, at 11:11, Michelle <michelle at msknight.com> wrote:
> 
> I removed the drive in order to a backup before I start messing around
> with things, which is why it isn't in the iostat. The backup will take
> probably until early evening.
> 
> This is what happened from messages around that time. Almost looks like
> whatever happened, it rebooted.
> 

From those, I’d say, you need to replace that disk. 

rgds,
toomas

> 
> Aug  5 01:55:01 jaguar smbd[601]: [ID 617204 daemon.error] Can't get
> SID for ID=0 type=1, status=-9977
> Aug  5 01:58:00 jaguar ahci: [ID 296163 kern.warning] WARNING: ahci0:
> ahci port 3 has task file error
> Aug  5 01:58:00 jaguar ahci: [ID 687168 kern.warning] WARNING: ahci0:
> ahci port 3 is trying to do error recovery
> Aug  5 01:58:00 jaguar ahci: [ID 693748 kern.warning] WARNING: ahci0:
> ahci port 3 task_file_status = 0x4041
> Aug  5 01:58:00 jaguar ahci: [ID 657156 kern.warning] WARNING: ahci0:
> error recovery for port 3 succeed
> Aug  5 01:58:09 jaguar ahci: [ID 296163 kern.warning] WARNING: ahci0:
> ahci port 3 has task file error
> Aug  5 01:58:09 jaguar ahci: [ID 687168 kern.warning] WARNING: ahci0:
> ahci port 3 is trying to do error recovery
> Aug  5 01:58:09 jaguar ahci: [ID 693748 kern.warning] WARNING: ahci0:
> ahci port 3 task_file_status = 0x4041
> Aug  5 01:58:09 jaguar ahci: [ID 657156 kern.warning] WARNING: ahci0:
> error recovery for port 3 succeed
> Aug  5 02:00:15 jaguar ahci: [ID 296163 kern.warning] WARNING: ahci0:
> ahci port 3 has task file error
> Aug  5 02:00:15 jaguar ahci: [ID 687168 kern.warning] WARNING: ahci0:
> ahci port 3 is trying to do error recovery
> Aug  5 02:00:15 jaguar ahci: [ID 693748 kern.warning] WARNING: ahci0:
> ahci port 3 task_file_status = 0x4041
> Aug  5 02:00:16 jaguar ahci: [ID 657156 kern.warning] WARNING: ahci0:
> error recovery for port 3 succeed
> Aug  5 02:00:20 jaguar ahci: [ID 296163 kern.warning] WARNING: ahci0:
> ahci port 3 has task file error
> Aug  5 02:00:20 jaguar ahci: [ID 687168 kern.warning] WARNING: ahci0:
> ahci port 3 is trying to do error recovery
> Aug  5 02:00:20 jaguar ahci: [ID 693748 kern.warning] WARNING: ahci0:
> ahci port 3 task_file_status = 0x4041
> Aug  5 02:00:20 jaguar ahci: [ID 657156 kern.warning] WARNING: ahci0:
> error recovery for port 3 succeed
> Aug  5 02:00:24 jaguar ahci: [ID 296163 kern.warning] WARNING: ahci0:
> ahci port 3 has task file error
> Aug  5 02:00:24 jaguar ahci: [ID 687168 kern.warning] WARNING: ahci0:
> ahci port 3 is trying to do error recovery
> Aug  5 02:00:24 jaguar ahci: [ID 693748 kern.warning] WARNING: ahci0:
> ahci port 3 task_file_status = 0x4041
> Aug  5 02:00:24 jaguar ahci: [ID 657156 kern.warning] WARNING: ahci0:
> error recovery for port 3 succeed
> Aug  5 02:00:24 jaguar ahci: [ID 811322 kern.info] NOTICE: ahci0:
> ahci_tran_reset_dport port 3 reset device
> Aug  5 02:00:29 jaguar ahci: [ID 296163 kern.warning] WARNING: ahci0:
> ahci port 3 has task file error
> Aug  5 02:00:29 jaguar ahci: [ID 687168 kern.warning] WARNING: ahci0:
> ahci port 3 is trying to do error recovery
> Aug  5 02:00:29 jaguar ahci: [ID 693748 kern.warning] WARNING: ahci0:
> ahci port 3 task_file_status = 0x4041
> Aug  5 02:00:29 jaguar ahci: [ID 657156 kern.warning] WARNING: ahci0:
> error recovery for port 3 succeed
> Aug  5 02:00:34 jaguar ahci: [ID 296163 kern.warning] WARNING: ahci0:
> ahci port 3 has task file error
> Aug  5 02:00:34 jaguar ahci: [ID 687168 kern.warning] WARNING: ahci0:
> ahci port 3 is trying to do error recovery
> Aug  5 02:00:34 jaguar ahci: [ID 693748 kern.warning] WARNING: ahci0:
> ahci port 3 task_file_status = 0x4041
> Aug  5 02:00:34 jaguar ahci: [ID 657156 kern.warning] WARNING: ahci0:
> error recovery for port 3 succeed
> Aug  5 02:00:38 jaguar ahci: [ID 296163 kern.warning] WARNING: ahci0:
> ahci port 3 has task file error
> Aug  5 02:00:38 jaguar ahci: [ID 687168 kern.warning] WARNING: ahci0:
> ahci port 3 is trying to do error recovery
> Aug  5 02:00:38 jaguar ahci: [ID 693748 kern.warning] WARNING: ahci0:
> ahci port 3 task_file_status = 0x4041
> Aug  5 02:00:38 jaguar ahci: [ID 657156 kern.warning] WARNING: ahci0:
> error recovery for port 3 succeed
> Aug  5 02:00:53 jaguar fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-
> 8000-FD, TYPE: Fault, VER: 1, SEVERITY: Major
> Aug  5 02:00:53 jaguar EVENT-TIME: Thu Aug  5 02:00:53 UTC 2021
> Aug  5 02:00:53 jaguar PLATFORM: ProLiant-MicroServer, CSN: 5C7351P4L9,
> HOSTNAME: jaguar
> Aug  5 02:00:53 jaguar SOURCE: zfs-diagnosis, REV: 1.0
> 
> 
> On Thu, 2021-08-05 at 11:03 +0300, Toomas Soome via openindiana-discuss 
> wrote:
>>> On 5. Aug 2021, at 10:52, Michelle <michelle at msknight.com> wrote:
>>> 
>>> Thanks for this. So I'm possibly better off rolling back the OS
>>> snapshot after my backup has finished?
>> 
>> maybe, maybe not. first of all, I have no idea to what point the
>> rollback would be.
>> 
>> secondly; the system has seen some errors, at this time, the fault
>> is, it does not tell us if those were checksum errors or something
>> else, and it seems to me, it is something else.
>> 
>> and this is why: if you look on your zpool output, you see report
>> about c6t3d0, but iostat -En below, it does not include c6t3d0. It
>> seems to be missing.
>> 
>> what do you get from: 'iostat -En c6t3d0’ ?
>> 
>> Also, it would be good idea to check /var/adm/messages, are there any
>> SATA or IO related messages around august 05. 02:00? 
>> 
>> FMA definitely has recorded an issue about pool, so there must be
>> something going on.
>> 
>> rgds,
>> toomas
>> 
>>> I have removed the drive for the moment, and am running a backup.
>>> Just
>>> in case :-)
>>> 
>>> mich at jaguar:~$ iostat -En
>>> c5d1             Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
>>> Model: INTEL SSDSA2M04 Revision:  Serial No: CVGB949301PC040 
>>> Size: 40.02GB <40019116032 bytes>
>>> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
>>> Illegal Request: 0 
>>> c6t1d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
>>> Vendor: ATA      Product: WDC WD40EZRZ-00G Revision: 0A80 Serial
>>> No:
>>> WD-WCC7K5UK24LJ 
>>> Size: 4000.79GB <4000787030016 bytes>
>>> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
>>> Illegal Request: 0 Predictive Failure Analysis: 0 
>>> c6t0d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
>>> Vendor: ATA      Product: WDC WD60EFRX-68L Revision: 0A82 Serial
>>> No:
>>> WD-WX21DA84EH0F 
>>> Size: 6001.18GB <6001175126016 bytes>
>>> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
>>> Illegal Request: 0 Predictive Failure Analysis: 0 
>>> c6t2d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
>>> Vendor: ATA      Product: WDC WD60EFRX-68L Revision: 0A82 Serial
>>> No:
>>> WD-WX51DB880RJ4 
>>> Size: 6001.18GB <6001175126016 bytes>
>>> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
>>> Illegal Request: 0 Predictive Failure Analysis: 0 
>>> 
>>> 
>>> --------------- ------------------------------------  -------------
>>> - --
>>> -------
>>> TIME            EVENT-ID                              MSG-
>>> ID         SEVERITY
>>> --------------- ------------------------------------  -------------
>>> - --
>>> -------
>>> Aug 05 02:00:53 c5934fd6-5f4b-409e-b0f8-8f44ea8f99c4  ZFS-8000-
>>> FD    Major     
>>> 
>>> Host        : jaguar
>>> Platform    : ProLiant-MicroServer      Chassis_id  : 5C7351P4L9
>>> Product_sn  : 
>>> 
>>> Fault class : fault.fs.zfs.vdev.io
>>> Affects     : zfs://pool=jaguar/vdev=740c01ae0d3c3109
>>>                 faulted and taken out of service
>>> Problem in  : zfs://pool=jaguar/vdev=740c01ae0d3c3109
>>>                 faulted and taken out of service
>>> 
>>> Description : The number of I/O errors associated with a ZFS device
>>> exceeded
>>>                    acceptable levels.  Refer to
>>>             http://illumos.org/msg/ZFS-8000-FD for more
>>> information.
>>> 
>>> Response    : The device has been offlined and marked as
>>> faulted.  An
>>> attempt
>>>                    will be made to activate a hot spare if
>>> available. 
>>> 
>>> Impact      : Fault tolerance of the pool may be compromised.
>>> 
>>> Action      : Run 'zpool status -x' and replace the bad device.
>>> 
>>> 
>>> 
>>> On Thu, 2021-08-05 at 10:22 +0300, Toomas Soome via openindiana-
>>> discuss 
>>> wrote:
>>>>> On 5. Aug 2021, at 09:35, Michelle <michelle at msknight.com>
>>>>> wrote:
>>>>> 
>>>>> Hi Folks,
>>>>> 
>>>>> About a month ago I updated my Hipster...
>>>>> SunOS jaguar 5.11 illumos-ca706442e6 i86pc i386 i86pc
>>>>> 
>>>>> This morning it was absolutely crawling. Couldn't even connect
>>>>> via
>>>>> SSH
>>>>> and had to bounce the box.
>>>>> 
>>>>> It was reporting a drive as faulted, but didn't give any
>>>>> numbers...
>>>>> everything was 0. I'm now not sure what happened and whether
>>>>> the
>>>>> drive
>>>>> is good, or whether I should roll back the OS.
>>>>> 
>>>>> (and the drive WD Red 6TB (not shingle) went out of warrantee a
>>>>> week
>>>>> ago. How about that, eh?)
>>>>> 
>>>>> Grateful for any opinions please.
>>>>> 
>>>>> Thu  5 Aug 04:00:01 UTC 2021
>>>>> NAME   SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DED
>>>>> UP  
>>>>> HE
>>>>> ALTH  ALTROOT
>>>>> lion  5.45T  5.28T   176G        -         -     4%    96%  1.0
>>>>> 0x  
>>>>> DEGR
>>>>> ADED  -
>>>>> pool: jaguar
>>>>> state: DEGRADED
>>>>> status: One or more devices are faulted in response to
>>>>> persistent
>>>>> errors.
>>>>> 	Sufficient replicas exist for the pool to continue
>>>>> functioning
>>>>> in a
>>>>> 	degraded state.
>>>>> action: Replace the faulted device, or use 'zpool clear' to
>>>>> mark
>>>>> the
>>>>> device
>>>>> 	repaired.
>>>>> scan: scrub in progress since Thu Aug  5 00:00:00 2021
>>>>> 	6.00T scanned at 428M/s, 5.02T issued at 358M/s, 7.90T
>>>>> total
>>>>> 	1M repaired, 63.59% done, 0 days 02:20:17 to go
>>>>> config:
>>>>> 	NAME        STATE     READ WRITE CKSUM
>>>>> 	jaguar      DEGRADED     0     0     0
>>>>> 	  raidz1-0  DEGRADED     0     0     0
>>>>> 	    c6t0d0  ONLINE       0     0     0
>>>>> 	    c6t2d0  ONLINE       0     0     0
>>>>> 	    c6t3d0  FAULTED      0     0     0  too many
>>>>> errors  (repairing)
>>>>> 
>>>> 
>>>> Can you postoutput from: 
>>>> iostat -En
>>>> fmadm faulty
>>>> 
>>>> in any case, there definitely is bug about error reporting -
>>>> counters
>>>> are zero while “too many errors” is reported.
>>>> 
>>>> rgds,
>>>> toomas
>>>> _______________________________________________
>>>> openindiana-discuss mailing list
>>>> openindiana-discuss at openindiana.org
>>>> https://openindiana.org/mailman/listinfo/openindiana-discuss
>>> 
>>> _______________________________________________
>>> openindiana-discuss mailing list
>>> openindiana-discuss at openindiana.org
>>> https://openindiana.org/mailman/listinfo/openindiana-discuss
>> 
>> _______________________________________________
>> openindiana-discuss mailing list
>> openindiana-discuss at openindiana.org
>> https://openindiana.org/mailman/listinfo/openindiana-discuss
> 
> 
> _______________________________________________
> openindiana-discuss mailing list
> openindiana-discuss at openindiana.org
> https://openindiana.org/mailman/listinfo/openindiana-discuss