[OpenIndiana-discuss] Errors without errors

Thu Aug 5 08:11:53 UTC 2021

I removed the drive in order to a backup before I start messing around
with things, which is why it isn't in the iostat. The backup will take
probably until early evening.

This is what happened from messages around that time. Almost looks like
whatever happened, it rebooted.

Aug  5 01:55:01 jaguar smbd[601]: [ID 617204 daemon.error] Can't get
SID for ID=0 type=1, status=-9977
Aug  5 01:58:00 jaguar ahci: [ID 296163 kern.warning] WARNING: ahci0:
ahci port 3 has task file error
Aug  5 01:58:00 jaguar ahci: [ID 687168 kern.warning] WARNING: ahci0:
ahci port 3 is trying to do error recovery
Aug  5 01:58:00 jaguar ahci: [ID 693748 kern.warning] WARNING: ahci0:
ahci port 3 task_file_status = 0x4041
Aug  5 01:58:00 jaguar ahci: [ID 657156 kern.warning] WARNING: ahci0:
error recovery for port 3 succeed
Aug  5 01:58:09 jaguar ahci: [ID 296163 kern.warning] WARNING: ahci0:
ahci port 3 has task file error
Aug  5 01:58:09 jaguar ahci: [ID 687168 kern.warning] WARNING: ahci0:
ahci port 3 is trying to do error recovery
Aug  5 01:58:09 jaguar ahci: [ID 693748 kern.warning] WARNING: ahci0:
ahci port 3 task_file_status = 0x4041
Aug  5 01:58:09 jaguar ahci: [ID 657156 kern.warning] WARNING: ahci0:
error recovery for port 3 succeed
Aug  5 02:00:15 jaguar ahci: [ID 296163 kern.warning] WARNING: ahci0:
ahci port 3 has task file error
Aug  5 02:00:15 jaguar ahci: [ID 687168 kern.warning] WARNING: ahci0:
ahci port 3 is trying to do error recovery
Aug  5 02:00:15 jaguar ahci: [ID 693748 kern.warning] WARNING: ahci0:
ahci port 3 task_file_status = 0x4041
Aug  5 02:00:16 jaguar ahci: [ID 657156 kern.warning] WARNING: ahci0:
error recovery for port 3 succeed
Aug  5 02:00:20 jaguar ahci: [ID 296163 kern.warning] WARNING: ahci0:
ahci port 3 has task file error
Aug  5 02:00:20 jaguar ahci: [ID 687168 kern.warning] WARNING: ahci0:
ahci port 3 is trying to do error recovery
Aug  5 02:00:20 jaguar ahci: [ID 693748 kern.warning] WARNING: ahci0:
ahci port 3 task_file_status = 0x4041
Aug  5 02:00:20 jaguar ahci: [ID 657156 kern.warning] WARNING: ahci0:
error recovery for port 3 succeed
Aug  5 02:00:24 jaguar ahci: [ID 296163 kern.warning] WARNING: ahci0:
ahci port 3 has task file error
Aug  5 02:00:24 jaguar ahci: [ID 687168 kern.warning] WARNING: ahci0:
ahci port 3 is trying to do error recovery
Aug  5 02:00:24 jaguar ahci: [ID 693748 kern.warning] WARNING: ahci0:
ahci port 3 task_file_status = 0x4041
Aug  5 02:00:24 jaguar ahci: [ID 657156 kern.warning] WARNING: ahci0:
error recovery for port 3 succeed
Aug  5 02:00:24 jaguar ahci: [ID 811322 kern.info] NOTICE: ahci0:
ahci_tran_reset_dport port 3 reset device
Aug  5 02:00:29 jaguar ahci: [ID 296163 kern.warning] WARNING: ahci0:
ahci port 3 has task file error
Aug  5 02:00:29 jaguar ahci: [ID 687168 kern.warning] WARNING: ahci0:
ahci port 3 is trying to do error recovery
Aug  5 02:00:29 jaguar ahci: [ID 693748 kern.warning] WARNING: ahci0:
ahci port 3 task_file_status = 0x4041
Aug  5 02:00:29 jaguar ahci: [ID 657156 kern.warning] WARNING: ahci0:
error recovery for port 3 succeed
Aug  5 02:00:34 jaguar ahci: [ID 296163 kern.warning] WARNING: ahci0:
ahci port 3 has task file error
Aug  5 02:00:34 jaguar ahci: [ID 687168 kern.warning] WARNING: ahci0:
ahci port 3 is trying to do error recovery
Aug  5 02:00:34 jaguar ahci: [ID 693748 kern.warning] WARNING: ahci0:
ahci port 3 task_file_status = 0x4041
Aug  5 02:00:34 jaguar ahci: [ID 657156 kern.warning] WARNING: ahci0:
error recovery for port 3 succeed
Aug  5 02:00:38 jaguar ahci: [ID 296163 kern.warning] WARNING: ahci0:
ahci port 3 has task file error
Aug  5 02:00:38 jaguar ahci: [ID 687168 kern.warning] WARNING: ahci0:
ahci port 3 is trying to do error recovery
Aug  5 02:00:38 jaguar ahci: [ID 693748 kern.warning] WARNING: ahci0:
ahci port 3 task_file_status = 0x4041
Aug  5 02:00:38 jaguar ahci: [ID 657156 kern.warning] WARNING: ahci0:
error recovery for port 3 succeed
Aug  5 02:00:53 jaguar fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-
8000-FD, TYPE: Fault, VER: 1, SEVERITY: Major
Aug  5 02:00:53 jaguar EVENT-TIME: Thu Aug  5 02:00:53 UTC 2021
Aug  5 02:00:53 jaguar PLATFORM: ProLiant-MicroServer, CSN: 5C7351P4L9,
HOSTNAME: jaguar
Aug  5 02:00:53 jaguar SOURCE: zfs-diagnosis, REV: 1.0

On Thu, 2021-08-05 at 11:03 +0300, Toomas Soome via openindiana-discuss 
wrote:
> > On 5. Aug 2021, at 10:52, Michelle <michelle at msknight.com> wrote:
> > 
> > Thanks for this. So I'm possibly better off rolling back the OS
> > snapshot after my backup has finished?
> 
> maybe, maybe not. first of all, I have no idea to what point the
> rollback would be.
> 
> secondly; the system has seen some errors, at this time, the fault
> is, it does not tell us if those were checksum errors or something
> else, and it seems to me, it is something else.
> 
> and this is why: if you look on your zpool output, you see report
> about c6t3d0, but iostat -En below, it does not include c6t3d0. It
> seems to be missing.
> 
> what do you get from: 'iostat -En c6t3d0’ ?
> 
> Also, it would be good idea to check /var/adm/messages, are there any
> SATA or IO related messages around august 05. 02:00? 
> 
> FMA definitely has recorded an issue about pool, so there must be
> something going on.
> 
> rgds,
> toomas
> 
> > I have removed the drive for the moment, and am running a backup.
> > Just
> > in case :-)
> > 
> > mich at jaguar:~$ iostat -En
> > c5d1             Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
> > Model: INTEL SSDSA2M04 Revision:  Serial No: CVGB949301PC040 
> > Size: 40.02GB <40019116032 bytes>
> > Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
> > Illegal Request: 0 
> > c6t1d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
> > Vendor: ATA      Product: WDC WD40EZRZ-00G Revision: 0A80 Serial
> > No:
> > WD-WCC7K5UK24LJ 
> > Size: 4000.79GB <4000787030016 bytes>
> > Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
> > Illegal Request: 0 Predictive Failure Analysis: 0 
> > c6t0d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
> > Vendor: ATA      Product: WDC WD60EFRX-68L Revision: 0A82 Serial
> > No:
> > WD-WX21DA84EH0F 
> > Size: 6001.18GB <6001175126016 bytes>
> > Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
> > Illegal Request: 0 Predictive Failure Analysis: 0 
> > c6t2d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
> > Vendor: ATA      Product: WDC WD60EFRX-68L Revision: 0A82 Serial
> > No:
> > WD-WX51DB880RJ4 
> > Size: 6001.18GB <6001175126016 bytes>
> > Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
> > Illegal Request: 0 Predictive Failure Analysis: 0 
> > 
> > 
> > --------------- ------------------------------------  -------------
> > - --
> > -------
> > TIME            EVENT-ID                              MSG-
> > ID         SEVERITY
> > --------------- ------------------------------------  -------------
> > - --
> > -------
> > Aug 05 02:00:53 c5934fd6-5f4b-409e-b0f8-8f44ea8f99c4  ZFS-8000-
> > FD    Major     
> > 
> > Host        : jaguar
> > Platform    : ProLiant-MicroServer      Chassis_id  : 5C7351P4L9
> > Product_sn  : 
> > 
> > Fault class : fault.fs.zfs.vdev.io
> > Affects     : zfs://pool=jaguar/vdev=740c01ae0d3c3109
> >                  faulted and taken out of service
> > Problem in  : zfs://pool=jaguar/vdev=740c01ae0d3c3109
> >                  faulted and taken out of service
> > 
> > Description : The number of I/O errors associated with a ZFS device
> > exceeded
> >                     acceptable levels.  Refer to
> >              http://illumos.org/msg/ZFS-8000-FD for more
> > information.
> > 
> > Response    : The device has been offlined and marked as
> > faulted.  An
> > attempt
> >                     will be made to activate a hot spare if
> > available. 
> > 
> > Impact      : Fault tolerance of the pool may be compromised.
> > 
> > Action      : Run 'zpool status -x' and replace the bad device.
> > 
> > 
> > 
> > On Thu, 2021-08-05 at 10:22 +0300, Toomas Soome via openindiana-
> > discuss 
> > wrote:
> > > > On 5. Aug 2021, at 09:35, Michelle <michelle at msknight.com>
> > > > wrote:
> > > > 
> > > > Hi Folks,
> > > > 
> > > > About a month ago I updated my Hipster...
> > > > SunOS jaguar 5.11 illumos-ca706442e6 i86pc i386 i86pc
> > > > 
> > > > This morning it was absolutely crawling. Couldn't even connect
> > > > via
> > > > SSH
> > > > and had to bounce the box.
> > > > 
> > > > It was reporting a drive as faulted, but didn't give any
> > > > numbers...
> > > > everything was 0. I'm now not sure what happened and whether
> > > > the
> > > > drive
> > > > is good, or whether I should roll back the OS.
> > > > 
> > > > (and the drive WD Red 6TB (not shingle) went out of warrantee a
> > > > week
> > > > ago. How about that, eh?)
> > > > 
> > > > Grateful for any opinions please.
> > > > 
> > > > Thu  5 Aug 04:00:01 UTC 2021
> > > > NAME   SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DED
> > > > UP  
> > > >  HE
> > > > ALTH  ALTROOT
> > > > lion  5.45T  5.28T   176G        -         -     4%    96%  1.0
> > > > 0x  
> > > > DEGR
> > > > ADED  -
> > > > pool: jaguar
> > > > state: DEGRADED
> > > > status: One or more devices are faulted in response to
> > > > persistent
> > > > errors.
> > > > 	Sufficient replicas exist for the pool to continue
> > > > functioning
> > > > in a
> > > > 	degraded state.
> > > > action: Replace the faulted device, or use 'zpool clear' to
> > > > mark
> > > > the
> > > > device
> > > > 	repaired.
> > > > scan: scrub in progress since Thu Aug  5 00:00:00 2021
> > > > 	6.00T scanned at 428M/s, 5.02T issued at 358M/s, 7.90T
> > > > total
> > > > 	1M repaired, 63.59% done, 0 days 02:20:17 to go
> > > > config:
> > > > 	NAME        STATE     READ WRITE CKSUM
> > > > 	jaguar      DEGRADED     0     0     0
> > > > 	  raidz1-0  DEGRADED     0     0     0
> > > > 	    c6t0d0  ONLINE       0     0     0
> > > > 	    c6t2d0  ONLINE       0     0     0
> > > > 	    c6t3d0  FAULTED      0     0     0  too many
> > > > errors  (repairing)
> > > > 
> > > 
> > > Can you postoutput from: 
> > > iostat -En
> > > fmadm faulty
> > > 
> > > in any case, there definitely is bug about error reporting -
> > > counters
> > > are zero while “too many errors” is reported.
> > > 
> > > rgds,
> > > toomas
> > > _______________________________________________
> > > openindiana-discuss mailing list
> > > openindiana-discuss at openindiana.org
> > > https://openindiana.org/mailman/listinfo/openindiana-discuss
> > 
> > _______________________________________________
> > openindiana-discuss mailing list
> > openindiana-discuss at openindiana.org
> > https://openindiana.org/mailman/listinfo/openindiana-discuss
> 
> _______________________________________________
> openindiana-discuss mailing list
> openindiana-discuss at openindiana.org
> https://openindiana.org/mailman/listinfo/openindiana-discuss