[OpenIndiana-discuss] Errors without errors

Fri Aug 6 05:45:54 UTC 2021

OK ... to update...

Apparently I did the update on 30th August. Previous installation was
some time in May.

I used beadm to revert to the previous boot environment, then rebooted
and destroyed the August environment.

The system then faulted rpool saying it needed zpool clear and refused
to boot, just hanging at the "Hipster" load screen.

I used option 3 for the loader prompt and couldn't find anything in
there that would help, but for the giggles I told it to boot anyway...
and it did.

So... exactly what happened there, I don't know. 

It resilvered the data set, which didn't take long at all... minutes...
so I've got a scrub running and so far, no errors on /var/adm/messages

Fingers crossed. Scheduled to take 16 hours.

I'm keeping an eye on the scrub status and the messages file and we'll
see what happens.

Michelle.

On Thu, 2021-08-05 at 10:46 -0300, Till Wegmueller wrote:
> Hi Michelle
> 
> For User home files you will need a backup anyway. For system 
> Consistency you can use `pkg fix` to restore the system image to a
> known 
> state in a new Boot environment.
> 
> Greetings
> Till
> 
> On 05.08.21 05:14, Toomas Soome via openindiana-discuss wrote:
> > 
> > > On 5. Aug 2021, at 11:11, Michelle <michelle at msknight.com> wrote:
> > > 
> > > I removed the drive in order to a backup before I start messing
> > > around
> > > with things, which is why it isn't in the iostat. The backup will
> > > take
> > > probably until early evening.
> > > 
> > > This is what happened from messages around that time. Almost
> > > looks like
> > > whatever happened, it rebooted.
> > > 
> > 
> >  From those, I’d say, you need to replace that disk.
> > 
> > rgds,
> > toomas
> > 
> > > Aug  5 01:55:01 jaguar smbd[601]: [ID 617204 daemon.error] Can't
> > > get
> > > SID for ID=0 type=1, status=-9977
> > > Aug  5 01:58:00 jaguar ahci: [ID 296163 kern.warning] WARNING:
> > > ahci0:
> > > ahci port 3 has task file error
> > > Aug  5 01:58:00 jaguar ahci: [ID 687168 kern.warning] WARNING:
> > > ahci0:
> > > ahci port 3 is trying to do error recovery
> > > Aug  5 01:58:00 jaguar ahci: [ID 693748 kern.warning] WARNING:
> > > ahci0:
> > > ahci port 3 task_file_status = 0x4041
> > > Aug  5 01:58:00 jaguar ahci: [ID 657156 kern.warning] WARNING:
> > > ahci0:
> > > error recovery for port 3 succeed
> > > Aug  5 01:58:09 jaguar ahci: [ID 296163 kern.warning] WARNING:
> > > ahci0:
> > > ahci port 3 has task file error
> > > Aug  5 01:58:09 jaguar ahci: [ID 687168 kern.warning] WARNING:
> > > ahci0:
> > > ahci port 3 is trying to do error recovery
> > > Aug  5 01:58:09 jaguar ahci: [ID 693748 kern.warning] WARNING:
> > > ahci0:
> > > ahci port 3 task_file_status = 0x4041
> > > Aug  5 01:58:09 jaguar ahci: [ID 657156 kern.warning] WARNING:
> > > ahci0:
> > > error recovery for port 3 succeed
> > > Aug  5 02:00:15 jaguar ahci: [ID 296163 kern.warning] WARNING:
> > > ahci0:
> > > ahci port 3 has task file error
> > > Aug  5 02:00:15 jaguar ahci: [ID 687168 kern.warning] WARNING:
> > > ahci0:
> > > ahci port 3 is trying to do error recovery
> > > Aug  5 02:00:15 jaguar ahci: [ID 693748 kern.warning] WARNING:
> > > ahci0:
> > > ahci port 3 task_file_status = 0x4041
> > > Aug  5 02:00:16 jaguar ahci: [ID 657156 kern.warning] WARNING:
> > > ahci0:
> > > error recovery for port 3 succeed
> > > Aug  5 02:00:20 jaguar ahci: [ID 296163 kern.warning] WARNING:
> > > ahci0:
> > > ahci port 3 has task file error
> > > Aug  5 02:00:20 jaguar ahci: [ID 687168 kern.warning] WARNING:
> > > ahci0:
> > > ahci port 3 is trying to do error recovery
> > > Aug  5 02:00:20 jaguar ahci: [ID 693748 kern.warning] WARNING:
> > > ahci0:
> > > ahci port 3 task_file_status = 0x4041
> > > Aug  5 02:00:20 jaguar ahci: [ID 657156 kern.warning] WARNING:
> > > ahci0:
> > > error recovery for port 3 succeed
> > > Aug  5 02:00:24 jaguar ahci: [ID 296163 kern.warning] WARNING:
> > > ahci0:
> > > ahci port 3 has task file error
> > > Aug  5 02:00:24 jaguar ahci: [ID 687168 kern.warning] WARNING:
> > > ahci0:
> > > ahci port 3 is trying to do error recovery
> > > Aug  5 02:00:24 jaguar ahci: [ID 693748 kern.warning] WARNING:
> > > ahci0:
> > > ahci port 3 task_file_status = 0x4041
> > > Aug  5 02:00:24 jaguar ahci: [ID 657156 kern.warning] WARNING:
> > > ahci0:
> > > error recovery for port 3 succeed
> > > Aug  5 02:00:24 jaguar ahci: [ID 811322 kern.info] NOTICE: ahci0:
> > > ahci_tran_reset_dport port 3 reset device
> > > Aug  5 02:00:29 jaguar ahci: [ID 296163 kern.warning] WARNING:
> > > ahci0:
> > > ahci port 3 has task file error
> > > Aug  5 02:00:29 jaguar ahci: [ID 687168 kern.warning] WARNING:
> > > ahci0:
> > > ahci port 3 is trying to do error recovery
> > > Aug  5 02:00:29 jaguar ahci: [ID 693748 kern.warning] WARNING:
> > > ahci0:
> > > ahci port 3 task_file_status = 0x4041
> > > Aug  5 02:00:29 jaguar ahci: [ID 657156 kern.warning] WARNING:
> > > ahci0:
> > > error recovery for port 3 succeed
> > > Aug  5 02:00:34 jaguar ahci: [ID 296163 kern.warning] WARNING:
> > > ahci0:
> > > ahci port 3 has task file error
> > > Aug  5 02:00:34 jaguar ahci: [ID 687168 kern.warning] WARNING:
> > > ahci0:
> > > ahci port 3 is trying to do error recovery
> > > Aug  5 02:00:34 jaguar ahci: [ID 693748 kern.warning] WARNING:
> > > ahci0:
> > > ahci port 3 task_file_status = 0x4041
> > > Aug  5 02:00:34 jaguar ahci: [ID 657156 kern.warning] WARNING:
> > > ahci0:
> > > error recovery for port 3 succeed
> > > Aug  5 02:00:38 jaguar ahci: [ID 296163 kern.warning] WARNING:
> > > ahci0:
> > > ahci port 3 has task file error
> > > Aug  5 02:00:38 jaguar ahci: [ID 687168 kern.warning] WARNING:
> > > ahci0:
> > > ahci port 3 is trying to do error recovery
> > > Aug  5 02:00:38 jaguar ahci: [ID 693748 kern.warning] WARNING:
> > > ahci0:
> > > ahci port 3 task_file_status = 0x4041
> > > Aug  5 02:00:38 jaguar ahci: [ID 657156 kern.warning] WARNING:
> > > ahci0:
> > > error recovery for port 3 succeed
> > > Aug  5 02:00:53 jaguar fmd: [ID 377184 daemon.error] SUNW-MSG-ID: 
> > > ZFS-
> > > 8000-FD, TYPE: Fault, VER: 1, SEVERITY: Major
> > > Aug  5 02:00:53 jaguar EVENT-TIME: Thu Aug  5 02:00:53 UTC 2021
> > > Aug  5 02:00:53 jaguar PLATFORM: ProLiant-MicroServer, CSN:
> > > 5C7351P4L9,
> > > HOSTNAME: jaguar
> > > Aug  5 02:00:53 jaguar SOURCE: zfs-diagnosis, REV: 1.0
> > > 
> > > 
> > > On Thu, 2021-08-05 at 11:03 +0300, Toomas Soome via openindiana-
> > > discuss
> > > wrote:
> > > > > On 5. Aug 2021, at 10:52, Michelle <michelle at msknight.com>
> > > > > wrote:
> > > > > 
> > > > > Thanks for this. So I'm possibly better off rolling back the
> > > > > OS
> > > > > snapshot after my backup has finished?
> > > > 
> > > > maybe, maybe not. first of all, I have no idea to what point
> > > > the
> > > > rollback would be.
> > > > 
> > > > secondly; the system has seen some errors, at this time, the
> > > > fault
> > > > is, it does not tell us if those were checksum errors or
> > > > something
> > > > else, and it seems to me, it is something else.
> > > > 
> > > > and this is why: if you look on your zpool output, you see
> > > > report
> > > > about c6t3d0, but iostat -En below, it does not include c6t3d0.
> > > > It
> > > > seems to be missing.
> > > > 
> > > > what do you get from: 'iostat -En c6t3d0’ ?
> > > > 
> > > > Also, it would be good idea to check /var/adm/messages, are
> > > > there any
> > > > SATA or IO related messages around august 05. 02:00?
> > > > 
> > > > FMA definitely has recorded an issue about pool, so there must
> > > > be
> > > > something going on.
> > > > 
> > > > rgds,
> > > > toomas
> > > > 
> > > > > I have removed the drive for the moment, and am running a
> > > > > backup.
> > > > > Just
> > > > > in case :-)
> > > > > 
> > > > > mich at jaguar:~$ iostat -En
> > > > > c5d1             Soft Errors: 0 Hard Errors: 0 Transport
> > > > > Errors: 0
> > > > > Model: INTEL SSDSA2M04 Revision:  Serial No: CVGB949301PC040
> > > > > Size: 40.02GB <40019116032 bytes>
> > > > > Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable:
> > > > > 0
> > > > > Illegal Request: 0
> > > > > c6t1d0           Soft Errors: 0 Hard Errors: 0 Transport
> > > > > Errors: 0
> > > > > Vendor: ATA      Product: WDC WD40EZRZ-00G Revision: 0A80
> > > > > Serial
> > > > > No:
> > > > > WD-WCC7K5UK24LJ
> > > > > Size: 4000.79GB <4000787030016 bytes>
> > > > > Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable:
> > > > > 0
> > > > > Illegal Request: 0 Predictive Failure Analysis: 0
> > > > > c6t0d0           Soft Errors: 0 Hard Errors: 0 Transport
> > > > > Errors: 0
> > > > > Vendor: ATA      Product: WDC WD60EFRX-68L Revision: 0A82
> > > > > Serial
> > > > > No:
> > > > > WD-WX21DA84EH0F
> > > > > Size: 6001.18GB <6001175126016 bytes>
> > > > > Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable:
> > > > > 0
> > > > > Illegal Request: 0 Predictive Failure Analysis: 0
> > > > > c6t2d0           Soft Errors: 0 Hard Errors: 0 Transport
> > > > > Errors: 0
> > > > > Vendor: ATA      Product: WDC WD60EFRX-68L Revision: 0A82
> > > > > Serial
> > > > > No:
> > > > > WD-WX51DB880RJ4
> > > > > Size: 6001.18GB <6001175126016 bytes>
> > > > > Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable:
> > > > > 0
> > > > > Illegal Request: 0 Predictive Failure Analysis: 0
> > > > > 
> > > > > 
> > > > > --------------- ------------------------------------  -------
> > > > > ------
> > > > > - --
> > > > > -------
> > > > > TIME            EVENT-ID                              MSG-
> > > > > ID         SEVERITY
> > > > > --------------- ------------------------------------  -------
> > > > > ------
> > > > > - --
> > > > > -------
> > > > > Aug 05 02:00:53 c5934fd6-5f4b-409e-b0f8-8f44ea8f99c4  ZFS-
> > > > > 8000-
> > > > > FD    Major
> > > > > 
> > > > > Host        : jaguar
> > > > > Platform    : ProLiant-MicroServer      Chassis_id  :
> > > > > 5C7351P4L9
> > > > > Product_sn  :
> > > > > 
> > > > > Fault class : fault.fs.zfs.vdev.io
> > > > > Affects     : zfs://pool=jaguar/vdev=740c01ae0d3c3109
> > > > >                  faulted and taken out of service
> > > > > Problem in  : zfs://pool=jaguar/vdev=740c01ae0d3c3109
> > > > >                  faulted and taken out of service
> > > > > 
> > > > > Description : The number of I/O errors associated with a ZFS
> > > > > device
> > > > > exceeded
> > > > >                     acceptable levels.  Refer to
> > > > >              http://illumos.org/msg/ZFS-8000-FD for more
> > > > > information.
> > > > > 
> > > > > Response    : The device has been offlined and marked as
> > > > > faulted.  An
> > > > > attempt
> > > > >                     will be made to activate a hot spare if
> > > > > available.
> > > > > 
> > > > > Impact      : Fault tolerance of the pool may be compromised.
> > > > > 
> > > > > Action      : Run 'zpool status -x' and replace the bad
> > > > > device.
> > > > > 
> > > > > 
> > > > > 
> > > > > On Thu, 2021-08-05 at 10:22 +0300, Toomas Soome via
> > > > > openindiana-
> > > > > discuss
> > > > > wrote:
> > > > > > > On 5. Aug 2021, at 09:35, Michelle <michelle at msknight.com
> > > > > > > >
> > > > > > > wrote:
> > > > > > > 
> > > > > > > Hi Folks,
> > > > > > > 
> > > > > > > About a month ago I updated my Hipster...
> > > > > > > SunOS jaguar 5.11 illumos-ca706442e6 i86pc i386 i86pc
> > > > > > > 
> > > > > > > This morning it was absolutely crawling. Couldn't even
> > > > > > > connect
> > > > > > > via
> > > > > > > SSH
> > > > > > > and had to bounce the box.
> > > > > > > 
> > > > > > > It was reporting a drive as faulted, but didn't give any
> > > > > > > numbers...
> > > > > > > everything was 0. I'm now not sure what happened and
> > > > > > > whether
> > > > > > > the
> > > > > > > drive
> > > > > > > is good, or whether I should roll back the OS.
> > > > > > > 
> > > > > > > (and the drive WD Red 6TB (not shingle) went out of
> > > > > > > warrantee a
> > > > > > > week
> > > > > > > ago. How about that, eh?)
> > > > > > > 
> > > > > > > Grateful for any opinions please.
> > > > > > > 
> > > > > > > Thu  5 Aug 04:00:01 UTC 2021
> > > > > > > NAME   SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CA
> > > > > > > P  DED
> > > > > > > UP
> > > > > > > HE
> > > > > > > ALTH  ALTROOT
> > > > > > > lion  5.45T  5.28T   176G        -         -     4%    96
> > > > > > > %  1.0
> > > > > > > 0x
> > > > > > > DEGR
> > > > > > > ADED  -
> > > > > > > pool: jaguar
> > > > > > > state: DEGRADED
> > > > > > > status: One or more devices are faulted in response to
> > > > > > > persistent
> > > > > > > errors.
> > > > > > > 	Sufficient replicas exist for the pool to continue
> > > > > > > functioning
> > > > > > > in a
> > > > > > > 	degraded state.
> > > > > > > action: Replace the faulted device, or use 'zpool clear'
> > > > > > > to
> > > > > > > mark
> > > > > > > the
> > > > > > > device
> > > > > > > 	repaired.
> > > > > > > scan: scrub in progress since Thu Aug  5 00:00:00 2021
> > > > > > > 	6.00T scanned at 428M/s, 5.02T issued at 358M/s, 7.90T
> > > > > > > total
> > > > > > > 	1M repaired, 63.59% done, 0 days 02:20:17 to go
> > > > > > > config:
> > > > > > > 	NAME        STATE     READ WRITE CKSUM
> > > > > > > 	jaguar      DEGRADED     0     0     0
> > > > > > > 	  raidz1-0  DEGRADED     0     0     0
> > > > > > > 	    c6t0d0  ONLINE       0     0     0
> > > > > > > 	    c6t2d0  ONLINE       0     0     0
> > > > > > > 	    c6t3d0  FAULTED      0     0     0  too many
> > > > > > > errors  (repairing)
> > > > > > > 
> > > > > > 
> > > > > > Can you postoutput from:
> > > > > > iostat -En
> > > > > > fmadm faulty
> > > > > > 
> > > > > > in any case, there definitely is bug about error reporting
> > > > > > -
> > > > > > counters
> > > > > > are zero while “too many errors” is reported.
> > > > > > 
> > > > > > rgds,
> > > > > > toomas
> > > > > > _______________________________________________
> > > > > > openindiana-discuss mailing list
> > > > > > openindiana-discuss at openindiana.org
> > > > > > https://openindiana.org/mailman/listinfo/openindiana-discuss
> > > > > 
> > > > > _______________________________________________
> > > > > openindiana-discuss mailing list
> > > > > openindiana-discuss at openindiana.org
> > > > > https://openindiana.org/mailman/listinfo/openindiana-discuss
> > > > 
> > > > _______________________________________________
> > > > openindiana-discuss mailing list
> > > > openindiana-discuss at openindiana.org
> > > > https://openindiana.org/mailman/listinfo/openindiana-discuss
> > > 
> > > _______________________________________________
> > > openindiana-discuss mailing list
> > > openindiana-discuss at openindiana.org
> > > https://openindiana.org/mailman/listinfo/openindiana-discuss
> > 
> > _______________________________________________
> > openindiana-discuss mailing list
> > openindiana-discuss at openindiana.org
> > https://openindiana.org/mailman/listinfo/openindiana-discuss
> > 
> 
> _______________________________________________
> openindiana-discuss mailing list
> openindiana-discuss at openindiana.org
> https://openindiana.org/mailman/listinfo/openindiana-discuss