[OpenIndiana-discuss] Errors without errors
Toomas Soome
tsoome at me.com
Thu Aug 5 08:14:52 UTC 2021
> On 5. Aug 2021, at 11:11, Michelle <michelle at msknight.com> wrote:
>
> I removed the drive in order to a backup before I start messing around
> with things, which is why it isn't in the iostat. The backup will take
> probably until early evening.
>
> This is what happened from messages around that time. Almost looks like
> whatever happened, it rebooted.
>
From those, I’d say, you need to replace that disk.
rgds,
toomas
>
> Aug 5 01:55:01 jaguar smbd[601]: [ID 617204 daemon.error] Can't get
> SID for ID=0 type=1, status=-9977
> Aug 5 01:58:00 jaguar ahci: [ID 296163 kern.warning] WARNING: ahci0:
> ahci port 3 has task file error
> Aug 5 01:58:00 jaguar ahci: [ID 687168 kern.warning] WARNING: ahci0:
> ahci port 3 is trying to do error recovery
> Aug 5 01:58:00 jaguar ahci: [ID 693748 kern.warning] WARNING: ahci0:
> ahci port 3 task_file_status = 0x4041
> Aug 5 01:58:00 jaguar ahci: [ID 657156 kern.warning] WARNING: ahci0:
> error recovery for port 3 succeed
> Aug 5 01:58:09 jaguar ahci: [ID 296163 kern.warning] WARNING: ahci0:
> ahci port 3 has task file error
> Aug 5 01:58:09 jaguar ahci: [ID 687168 kern.warning] WARNING: ahci0:
> ahci port 3 is trying to do error recovery
> Aug 5 01:58:09 jaguar ahci: [ID 693748 kern.warning] WARNING: ahci0:
> ahci port 3 task_file_status = 0x4041
> Aug 5 01:58:09 jaguar ahci: [ID 657156 kern.warning] WARNING: ahci0:
> error recovery for port 3 succeed
> Aug 5 02:00:15 jaguar ahci: [ID 296163 kern.warning] WARNING: ahci0:
> ahci port 3 has task file error
> Aug 5 02:00:15 jaguar ahci: [ID 687168 kern.warning] WARNING: ahci0:
> ahci port 3 is trying to do error recovery
> Aug 5 02:00:15 jaguar ahci: [ID 693748 kern.warning] WARNING: ahci0:
> ahci port 3 task_file_status = 0x4041
> Aug 5 02:00:16 jaguar ahci: [ID 657156 kern.warning] WARNING: ahci0:
> error recovery for port 3 succeed
> Aug 5 02:00:20 jaguar ahci: [ID 296163 kern.warning] WARNING: ahci0:
> ahci port 3 has task file error
> Aug 5 02:00:20 jaguar ahci: [ID 687168 kern.warning] WARNING: ahci0:
> ahci port 3 is trying to do error recovery
> Aug 5 02:00:20 jaguar ahci: [ID 693748 kern.warning] WARNING: ahci0:
> ahci port 3 task_file_status = 0x4041
> Aug 5 02:00:20 jaguar ahci: [ID 657156 kern.warning] WARNING: ahci0:
> error recovery for port 3 succeed
> Aug 5 02:00:24 jaguar ahci: [ID 296163 kern.warning] WARNING: ahci0:
> ahci port 3 has task file error
> Aug 5 02:00:24 jaguar ahci: [ID 687168 kern.warning] WARNING: ahci0:
> ahci port 3 is trying to do error recovery
> Aug 5 02:00:24 jaguar ahci: [ID 693748 kern.warning] WARNING: ahci0:
> ahci port 3 task_file_status = 0x4041
> Aug 5 02:00:24 jaguar ahci: [ID 657156 kern.warning] WARNING: ahci0:
> error recovery for port 3 succeed
> Aug 5 02:00:24 jaguar ahci: [ID 811322 kern.info] NOTICE: ahci0:
> ahci_tran_reset_dport port 3 reset device
> Aug 5 02:00:29 jaguar ahci: [ID 296163 kern.warning] WARNING: ahci0:
> ahci port 3 has task file error
> Aug 5 02:00:29 jaguar ahci: [ID 687168 kern.warning] WARNING: ahci0:
> ahci port 3 is trying to do error recovery
> Aug 5 02:00:29 jaguar ahci: [ID 693748 kern.warning] WARNING: ahci0:
> ahci port 3 task_file_status = 0x4041
> Aug 5 02:00:29 jaguar ahci: [ID 657156 kern.warning] WARNING: ahci0:
> error recovery for port 3 succeed
> Aug 5 02:00:34 jaguar ahci: [ID 296163 kern.warning] WARNING: ahci0:
> ahci port 3 has task file error
> Aug 5 02:00:34 jaguar ahci: [ID 687168 kern.warning] WARNING: ahci0:
> ahci port 3 is trying to do error recovery
> Aug 5 02:00:34 jaguar ahci: [ID 693748 kern.warning] WARNING: ahci0:
> ahci port 3 task_file_status = 0x4041
> Aug 5 02:00:34 jaguar ahci: [ID 657156 kern.warning] WARNING: ahci0:
> error recovery for port 3 succeed
> Aug 5 02:00:38 jaguar ahci: [ID 296163 kern.warning] WARNING: ahci0:
> ahci port 3 has task file error
> Aug 5 02:00:38 jaguar ahci: [ID 687168 kern.warning] WARNING: ahci0:
> ahci port 3 is trying to do error recovery
> Aug 5 02:00:38 jaguar ahci: [ID 693748 kern.warning] WARNING: ahci0:
> ahci port 3 task_file_status = 0x4041
> Aug 5 02:00:38 jaguar ahci: [ID 657156 kern.warning] WARNING: ahci0:
> error recovery for port 3 succeed
> Aug 5 02:00:53 jaguar fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-
> 8000-FD, TYPE: Fault, VER: 1, SEVERITY: Major
> Aug 5 02:00:53 jaguar EVENT-TIME: Thu Aug 5 02:00:53 UTC 2021
> Aug 5 02:00:53 jaguar PLATFORM: ProLiant-MicroServer, CSN: 5C7351P4L9,
> HOSTNAME: jaguar
> Aug 5 02:00:53 jaguar SOURCE: zfs-diagnosis, REV: 1.0
>
>
> On Thu, 2021-08-05 at 11:03 +0300, Toomas Soome via openindiana-discuss
> wrote:
>>> On 5. Aug 2021, at 10:52, Michelle <michelle at msknight.com> wrote:
>>>
>>> Thanks for this. So I'm possibly better off rolling back the OS
>>> snapshot after my backup has finished?
>>
>> maybe, maybe not. first of all, I have no idea to what point the
>> rollback would be.
>>
>> secondly; the system has seen some errors, at this time, the fault
>> is, it does not tell us if those were checksum errors or something
>> else, and it seems to me, it is something else.
>>
>> and this is why: if you look on your zpool output, you see report
>> about c6t3d0, but iostat -En below, it does not include c6t3d0. It
>> seems to be missing.
>>
>> what do you get from: 'iostat -En c6t3d0’ ?
>>
>> Also, it would be good idea to check /var/adm/messages, are there any
>> SATA or IO related messages around august 05. 02:00?
>>
>> FMA definitely has recorded an issue about pool, so there must be
>> something going on.
>>
>> rgds,
>> toomas
>>
>>> I have removed the drive for the moment, and am running a backup.
>>> Just
>>> in case :-)
>>>
>>> mich at jaguar:~$ iostat -En
>>> c5d1 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
>>> Model: INTEL SSDSA2M04 Revision: Serial No: CVGB949301PC040
>>> Size: 40.02GB <40019116032 bytes>
>>> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
>>> Illegal Request: 0
>>> c6t1d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
>>> Vendor: ATA Product: WDC WD40EZRZ-00G Revision: 0A80 Serial
>>> No:
>>> WD-WCC7K5UK24LJ
>>> Size: 4000.79GB <4000787030016 bytes>
>>> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
>>> Illegal Request: 0 Predictive Failure Analysis: 0
>>> c6t0d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
>>> Vendor: ATA Product: WDC WD60EFRX-68L Revision: 0A82 Serial
>>> No:
>>> WD-WX21DA84EH0F
>>> Size: 6001.18GB <6001175126016 bytes>
>>> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
>>> Illegal Request: 0 Predictive Failure Analysis: 0
>>> c6t2d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
>>> Vendor: ATA Product: WDC WD60EFRX-68L Revision: 0A82 Serial
>>> No:
>>> WD-WX51DB880RJ4
>>> Size: 6001.18GB <6001175126016 bytes>
>>> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
>>> Illegal Request: 0 Predictive Failure Analysis: 0
>>>
>>>
>>> --------------- ------------------------------------ -------------
>>> - --
>>> -------
>>> TIME EVENT-ID MSG-
>>> ID SEVERITY
>>> --------------- ------------------------------------ -------------
>>> - --
>>> -------
>>> Aug 05 02:00:53 c5934fd6-5f4b-409e-b0f8-8f44ea8f99c4 ZFS-8000-
>>> FD Major
>>>
>>> Host : jaguar
>>> Platform : ProLiant-MicroServer Chassis_id : 5C7351P4L9
>>> Product_sn :
>>>
>>> Fault class : fault.fs.zfs.vdev.io
>>> Affects : zfs://pool=jaguar/vdev=740c01ae0d3c3109
>>> faulted and taken out of service
>>> Problem in : zfs://pool=jaguar/vdev=740c01ae0d3c3109
>>> faulted and taken out of service
>>>
>>> Description : The number of I/O errors associated with a ZFS device
>>> exceeded
>>> acceptable levels. Refer to
>>> http://illumos.org/msg/ZFS-8000-FD for more
>>> information.
>>>
>>> Response : The device has been offlined and marked as
>>> faulted. An
>>> attempt
>>> will be made to activate a hot spare if
>>> available.
>>>
>>> Impact : Fault tolerance of the pool may be compromised.
>>>
>>> Action : Run 'zpool status -x' and replace the bad device.
>>>
>>>
>>>
>>> On Thu, 2021-08-05 at 10:22 +0300, Toomas Soome via openindiana-
>>> discuss
>>> wrote:
>>>>> On 5. Aug 2021, at 09:35, Michelle <michelle at msknight.com>
>>>>> wrote:
>>>>>
>>>>> Hi Folks,
>>>>>
>>>>> About a month ago I updated my Hipster...
>>>>> SunOS jaguar 5.11 illumos-ca706442e6 i86pc i386 i86pc
>>>>>
>>>>> This morning it was absolutely crawling. Couldn't even connect
>>>>> via
>>>>> SSH
>>>>> and had to bounce the box.
>>>>>
>>>>> It was reporting a drive as faulted, but didn't give any
>>>>> numbers...
>>>>> everything was 0. I'm now not sure what happened and whether
>>>>> the
>>>>> drive
>>>>> is good, or whether I should roll back the OS.
>>>>>
>>>>> (and the drive WD Red 6TB (not shingle) went out of warrantee a
>>>>> week
>>>>> ago. How about that, eh?)
>>>>>
>>>>> Grateful for any opinions please.
>>>>>
>>>>> Thu 5 Aug 04:00:01 UTC 2021
>>>>> NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DED
>>>>> UP
>>>>> HE
>>>>> ALTH ALTROOT
>>>>> lion 5.45T 5.28T 176G - - 4% 96% 1.0
>>>>> 0x
>>>>> DEGR
>>>>> ADED -
>>>>> pool: jaguar
>>>>> state: DEGRADED
>>>>> status: One or more devices are faulted in response to
>>>>> persistent
>>>>> errors.
>>>>> Sufficient replicas exist for the pool to continue
>>>>> functioning
>>>>> in a
>>>>> degraded state.
>>>>> action: Replace the faulted device, or use 'zpool clear' to
>>>>> mark
>>>>> the
>>>>> device
>>>>> repaired.
>>>>> scan: scrub in progress since Thu Aug 5 00:00:00 2021
>>>>> 6.00T scanned at 428M/s, 5.02T issued at 358M/s, 7.90T
>>>>> total
>>>>> 1M repaired, 63.59% done, 0 days 02:20:17 to go
>>>>> config:
>>>>> NAME STATE READ WRITE CKSUM
>>>>> jaguar DEGRADED 0 0 0
>>>>> raidz1-0 DEGRADED 0 0 0
>>>>> c6t0d0 ONLINE 0 0 0
>>>>> c6t2d0 ONLINE 0 0 0
>>>>> c6t3d0 FAULTED 0 0 0 too many
>>>>> errors (repairing)
>>>>>
>>>>
>>>> Can you postoutput from:
>>>> iostat -En
>>>> fmadm faulty
>>>>
>>>> in any case, there definitely is bug about error reporting -
>>>> counters
>>>> are zero while “too many errors” is reported.
>>>>
>>>> rgds,
>>>> toomas
>>>> _______________________________________________
>>>> openindiana-discuss mailing list
>>>> openindiana-discuss at openindiana.org
>>>> https://openindiana.org/mailman/listinfo/openindiana-discuss
>>>
>>> _______________________________________________
>>> openindiana-discuss mailing list
>>> openindiana-discuss at openindiana.org
>>> https://openindiana.org/mailman/listinfo/openindiana-discuss
>>
>> _______________________________________________
>> openindiana-discuss mailing list
>> openindiana-discuss at openindiana.org
>> https://openindiana.org/mailman/listinfo/openindiana-discuss
>
>
> _______________________________________________
> openindiana-discuss mailing list
> openindiana-discuss at openindiana.org
> https://openindiana.org/mailman/listinfo/openindiana-discuss
More information about the openindiana-discuss
mailing list