[OpenIndiana-discuss] ZFS read speed(iSCSI)

Heinrich van Riel heinrich.vanriel at gmail.com
Mon Jun 10 18:46:06 UTC 2013


Just want to provide an update here.

Installed Solaris 11.1 reconfigured everything. Went back to Emulex card
since it is a dual port for connect to both switches. Same problem, well
the link does not fail, but it is writing at 20k/s.


I am really not sure what to do anymore other that to accept fc target is
no longer an option, but I will post in the ora solaris forum. Either this
has been an issue for some time or it is a hardware combination or perhaps
I am doing something seriously wrong.





On Sat, Jun 8, 2013 at 6:57 PM, Heinrich van Riel <
heinrich.vanriel at gmail.com> wrote:

> I took a look at every server that I knew I could power down or that is
> slated for removal in the future and I found a qlogic adapter not in use.
>
> HBA Port WWN: 2100001b3280b
>         Port Mode: Target
>         Port ID: 12000
>         OS Device Name: Not Applicable
>         Manufacturer: QLogic Corp.
>         Model: QLE2460
>         Firmware Version: 5.2.1
>         FCode/BIOS Version: N/A
>         Serial Number: not available
>         Driver Name: COMSTAR QLT
>         Driver Version: 20100505-1.05
>         Type: F-port
>         State: online
>         Supported Speeds: 1Gb 2Gb 4Gb
>         Current Speed: 4Gb
>         Node WWN: 2000001b3280b
>
>
> Link does not go down but useless, right from the start it is as slow as
> the emulex after I made the xfer change.
> So it is not a driver issue.
>
> alloc free read write read write
> ----- ----- ----- ----- ----- -----
> 681G 53.8T 5 12 29.9K 51.3K
> 681G 53.8T 0 0 0 0
> 681G 53.8T 0 0 0 0
> 681G 53.8T 0 0 0 0
> 681G 53.8T 0 0 0 0
> 681G 53.8T 0 88 0 221K
> 681G 53.8T 0 0 0 0
> 681G 53.8T 0 0 0 0
> 681G 53.8T 0 0 0 0
> 681G 53.8T 0 0 0 0
> 681G 53.8T 0 163 0 812K
> 681G 53.8T 0 0 0 0
> 681G 53.8T 0 0 0 0
> 681G 53.8T 0 0 0 0
> 681G 53.8T 0 0 0 0
> 681G 53.8T 0 198 0 1.13M
> 681G 53.8T 0 0 0 0
> 681G 53.8T 0 0 0 0
> 681G 53.8T 0 0 0 0
> 681G 53.8T 0 0 0 0
> 681G 53.8T 0 88 0 221K
> 681G 53.8T 0 0 0 0
> 681G 53.8T 0 0 0 0
> 681G 53.8T 0 0 0 0
> 681G 53.8T 0 0 0 0
> 681G 53.8T 0 187 0 1.02M
> 681G 53.8T 0 0 0 0
> 681G 53.8T 0 0 0 0
> 681G 53.8T 0 0 0 0
> 681G 53.8T 0 0 0 0
>
> This is a clean install of a7 with nothing done other than nic config in
> lacp. I did not attempt a reinstall of a5 yet and prob won't either.
> I dont know what to do anymore I was going to try OmniOS but there is no
> way of knowing if it would work.
>
>
> I will see if I can get approved for a solaris license for one year, if
> not I am switching back to windows storage spaces. Cant backup the current
> lab on the EMC array to this node in any event since there is no ip
> connectivity and fc is a dream.
>
> Guess I am the only one trying to use it as an fc target and these
> problems are not noticed.
>
>
>
> On Sat, Jun 8, 2013 at 4:55 PM, Heinrich van Riel <
> heinrich.vanriel at gmail.com> wrote:
>
>> changing max-xfer-size causes the link to stay up and no problem are
>> reported from stmf.
>>
>> #       Memory_model       max-xfer-size
>> #     ----------------------------------------
>> #       Small              131072 - 339968
>> #       Medium             339969 - 688128
>> #       Large              688129 - 1388544
>> #
>> # Range:  Min:131072   Max:1388544   Default:339968
>> #
>> max-xfer-size=339968;
>>
>> as soon as I changed it to 339969 the there is no link loss, but I would
>> be so lucky that is solves my problem. after a few min it would grind to a
>> crawl, so much so that in vmware it will take well over a min to just
>> browse a folder, we talking are a few k/s.
>>
>> Setting it to the max causes the the link to go down again and smtf
>> reports the following again:
>> FROM STMF:0062568: abort_task_offline called for LPORT: lport abort timed
>> out
>>
>> I also played around with the buffer settings.
>>
>> Any ideas?
>> Thanks,
>>
>>
>>
>>  On Fri, Jun 7, 2013 at 8:38 PM, Heinrich van Riel <
>> heinrich.vanriel at gmail.com> wrote:
>>
>>> New card, different PCI-E slot (removed the other one) different FC
>>> switch (same model with same code) older hba firmware (2.72a2)  = same
>>> result.
>>>
>>> On the setting changes when it boots it complains about this option,
>>> does not exist: szfs_txg_synctime
>>> The changes still allowed for a constant write, but at a max of 100Mb/s
>>> so not much better than iscsi over 1Gbe. I guess I would need to increase
>>> write_limit_override. if i disable the settings again it shows 240MB/s
>>> with bursts up to 300, both stats are from VMware's disk perf monitoring
>>> while cloning the same VM.
>>>
>>> All iSCSI luns remain active with no impact.
>>> So I will conclude, I guess, it seems to be the problem that was there
>>> in 2009 from build ~100 to 128. When I search the error messages all posts
>>> date back to 2009.
>>>
>>> I will try one more thing to reinstall with 151a5 since a server that
>>> was removed from the env was running this with no issues, but was using an
>>> older emulex HBA, LP10000 PCIX
>>> Looking at the notable changes in the release notes past a5 I do see
>>> anything that changed that I would think would cause the behavior. Would
>>> this just be a waste of time?
>>>
>>>
>>>
>>> On Fri, Jun 7, 2013 at 6:36 PM, Heinrich van Riel <
>>> heinrich.vanriel at gmail.com> wrote:
>>>
>>>> In the debug info I see 1000's of the following events:
>>>>
>>>> FROM STMF:0149225: abort_task_offline called for LPORT: lport abort
>>>> timed out
>>>> FROM STMF:0149225: abort_task_offline called for LPORT: lport abort
>>>> timed out
>>>> FROM STMF:0149225: abort_task_offline called for LPORT: lport abort
>>>> timed out
>>>> FROM STMF:0149226: abort_task_offline called for LPORT: lport abort
>>>> timed out
>>>> FROM STMF:0149226: abort_task_offline called for LPORT: lport abort
>>>> timed out
>>>> FROM STMF:0149226: abort_task_offline called for LPORT: lport abort
>>>> timed out
>>>> FROM STMF:0149227: abort_task_offline called for LPORT: lport abort
>>>> timed out
>>>> FROM STMF:0149227: abort_task_offline called for LPORT: lport abort
>>>> timed out
>>>> FROM STMF:0149227: abort_task_offline called for LPORT: lport abort
>>>> timed out
>>>> emlxs1:0149228: port state change from 11 to 11
>>>> FROM STMF:0149228: abort_task_offline called for LPORT: lport abort
>>>> timed out
>>>> FROM STMF:0149228: abort_task_offline called for LPORT: lport abort
>>>> timed out
>>>> FROM STMF:0149228: abort_task_offline called for LPORT: lport abort
>>>> timed out
>>>> :0149228: fct_port_shutdown: port-ffffff1157ff1278, fct_process_logo:
>>>> unable to
>>>> clean up I/O. iport-ffffff1157ff1378, icmd-ffffff1195463110
>>>> FROM STMF:0149229: abort_task_offline called for LPORT: lport abort
>>>> timed out
>>>> FROM STMF:0149229: abort_task_offline called for LPORT: lport abort
>>>> timed out
>>>> FROM STMF:0149229: abort_task_offline called for LPORT: lport abort
>>>> timed out
>>>>
>>>>
>>>> And then the following as the port recovers.
>>>>
>>>> emlxs1:0150128: port state change from 11 to 11
>>>> emlxs1:0150128: port state change from 11 to 0
>>>> emlxs1:0150128: port state change from 0 to 11
>>>> emlxs1:0150128: port state change from 11 to 0
>>>> :0150850: fct_port_initialize: port-ffffff1157ff1278, emlxs initialize
>>>> emlxs1:0150950: port state change from 0 to e
>>>> emlxs1:0150953: Posting sol ELS 3 (PLOGI) rp_id=fffffd lp_id=22000
>>>> emlxs1:0150953: Processing sol ELS 3 (PLOGI) rp_id=fffffd
>>>> emlxs1:0150953: Sol ELS 3 (PLOGI) completed with status 0, did/fffffd
>>>> emlxs1:0150953: Posting sol ELS 62 (SCR) rp_id=fffffd lp_id=22000
>>>> emlxs1:0150953: Processing sol ELS 62 (SCR) rp_id=fffffd
>>>> emlxs1:0150953: Sol ELS 62 (SCR) completed with status 0, did/fffffd
>>>> emlxs1:0151053: Posting sol ELS 3 (PLOGI) rp_id=fffffc lp_id=22000
>>>> emlxs1:0151053: Processing sol ELS 3 (PLOGI) rp_id=fffffc
>>>> emlxs1:0151053: Sol ELS 3 (PLOGI) completed with status 0, did/fffffc
>>>> emlxs1:0151054: Posting unsol ELS 3 (PLOGI) rp_id=fffc02 lp_id=22000
>>>> emlxs1:0151054: Processing unsol ELS 3 (PLOGI) rp_id=fffc02
>>>> emlxs1:0151054: Posting unsol ELS 20 (PRLI) rp_id=fffc02 lp_id=22000
>>>> emlxs1:0151054: Processing unsol ELS 20 (PRLI) rp_id=fffc02
>>>> emlxs1:0151055: Posting unsol ELS 5 (LOGO) rp_id=fffc02 lp_id=22000
>>>> emlxs1:0151055: Processing unsol ELS 5 (LOGO) rp_id=fffc02
>>>> emlxs1:0151146: Posting unsol ELS 3 (PLOGI) rp_id=21500 lp_id=22000
>>>> emlxs1:0151146: Processing unsol ELS 3 (PLOGI) rp_id=21500
>>>> emlxs1:0151146: Posting unsol ELS 20 (PRLI) rp_id=21500 lp_id=22000
>>>>  emlxs1:0151146: Processing unsol ELS 20 (PRLI) rp_id=21500
>>>> emlxs1:0151146: Posting unsol ELS 3 (PLOGI) rp_id=21600 lp_id=22000
>>>> emlxs1:0151146: Processing unsol ELS 3 (PLOGI) rp_id=21600
>>>> emlxs1:0151146: Posting unsol ELS 20 (PRLI) rp_id=21600 lp_id=22000
>>>> emlxs1:0151146: Processing unsol ELS 20 (PRLI) rp_id=21600
>>>> emlxs1:0151338: Posting unsol ELS 3 (PLOGI) rp_id=21500 lp_id=22000
>>>> emlxs1:0151338: Processing unsol ELS 3 (PLOGI) rp_id=21500
>>>> emlxs1:0151338: Posting unsol ELS 20 (PRLI) rp_id=21500 lp_id=22000
>>>> emlxs1:0151338: Processing unsol ELS 20 (PRLI) rp_id=21500
>>>> emlxs1:0151338: Posting unsol ELS 3 (PLOGI) rp_id=21600 lp_id=22000
>>>>  emlxs1:0151338: Processing unsol ELS 3 (PLOGI) rp_id=21600
>>>> emlxs1:0151338: Posting unsol ELS 20 (PRLI) rp_id=21600 lp_id=22000
>>>> emlxs1:0151338: Processing unsol ELS 20 (PRLI) rp_id=21600
>>>> emlxs1:0151428: Posting unsol ELS 3 (PLOGI) rp_id=21500 lp_id=22000
>>>> emlxs1:0151428: Processing unsol ELS 3 (PLOGI) rp_id=21500
>>>> emlxs1:0151428: port state change from e to 4
>>>> emlxs1:0151428: Posting unsol ELS 20 (PRLI) rp_id=21500 lp_id=22000
>>>> emlxs1:0151428: Processing unsol ELS 20 (PRLI) rp_id=21500
>>>> emlxs1:0151428: Posting unsol ELS 3 (PLOGI) rp_id=21600 lp_id=22000
>>>> emlxs1:0151428: Processing unsol ELS 3 (PLOGI) rp_id=21600
>>>> emlxs1:0151428: Posting unsol ELS 20 (PRLI) rp_id=21600 lp_id=22000
>>>> emlxs1:0151428: Processing unsol ELS 20 (PRLI) rp_id=21600
>>>>
>>>> To be honest it does not really tell me much since I do not understand
>>>> comstar to these depths. It would appear that the link fails so either
>>>> driver problem or hardware issue? I will replace the LPe11002 with a brand
>>>> new unopened one and then  give up on FC on OI.
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Jun 7, 2013 at 4:54 PM, Heinrich van Riel <
>>>> heinrich.vanriel at gmail.com> wrote:
>>>>
>>>>> I did find this in my inbox from 2009, I have been using FC with ZFS
>>>>> for quite sometime and only recently retired an install with OI a5 that was
>>>>> upgraded from opensolaris. It did not do real heavy duty stuff, but I had a
>>>>> similar problem where we were stuck on build 99 for quite some time.
>>>>>
>>>>> To  Jean-Yves Chevallier at emulex
>>>>> Any comments on the future of Emulex with regards to the COMSTAR
>>>>> project?
>>>>> It seems I am not the only one that have problems using Emulex in
>>>>> later builds. For now I am stuck with build 99.
>>>>> As always any feedback would be greatly appreciated since we have to
>>>>> make a decision of sticking with Opensolaris & COMSTAR or start migrating
>>>>> to another solution since we cannot stay on build 99 forever.
>>>>> What I am really trying to find out is if there is a roadmap/decision
>>>>> to ultimately only support Qlogic HBA’s in target mode.
>>>>>
>>>>> Response:
>>>>>
>>>>>
>>>>> Sorry for the delay in answering you. I do have news for you.
>>>>> First off, the interface used by COMSTAR has changed in recent Nevada
>>>>> releases (NV120 and up I believe). Since it is not a public interface we
>>>>> had no prior indication on this.
>>>>> We know of a number of issues, some on our driver, some on the COMSTAR
>>>>> stack. Based on the information we have from you and other community
>>>>> members, we have addressed all these issues in our next driver version – we
>>>>> will know for sure after we run our DVT (driver verification testing) next
>>>>> week. Depending on progress, this driver will be part of NV 128 or else NV
>>>>> 130.
>>>>> I believe it is worth taking another look based on these upcoming
>>>>> builds, which I imagine might also include fixes to the rest of the COMSTAR
>>>>> stack.
>>>>>
>>>>> Best regards.
>>>>>
>>>>>
>>>>> I can confirm that this was fixed in 128 and all I did was update from
>>>>> 99 to 128 and there were no problems.
>>>>> Seem like the same problem has now returned and emulex does not appear
>>>>> to be a good fit since sun mostly used qlogic.
>>>>>
>>>>> guess it is back to iscsi only for now.
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Jun 7, 2013 at 4:40 PM, Heinrich van Riel <
>>>>> heinrich.vanriel at gmail.com> wrote:
>>>>>
>>>>>> I changed the settings. I do see it writing all the time now, but the
>>>>>> link still dies after a a few min
>>>>>>
>>>>>> Jun  7 16:30:57  emlxs: [ID 349649 kern.info] [ 5.0608]emlxs1:
>>>>>> NOTICE: 730: Link reset. (Disabling link...)
>>>>>> Jun  7 16:30:57 emlxs: [ID 349649 kern.info] [ 5.0333]emlxs1:
>>>>>> NOTICE: 710: Link down.
>>>>>> Jun  7 16:33:16 emlxs: [ID 349649 kern.info] [ 5.055D]emlxs1:
>>>>>> NOTICE: 720: Link up. (4Gb, fabric, target)
>>>>>> Jun  7 16:33:16 fct: [ID 132490 kern.notice] NOTICE: emlxs1 LINK UP,
>>>>>> portid 22000, topology Fabric Pt-to-Pt,speed 4G
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Jun 7, 2013 at 3:06 PM, Jim Klimov <jimklimov at cos.ru> wrote:
>>>>>>
>>>>>>> Comment below
>>>>>>>
>>>>>>>
>>>>>>> On 2013-06-07 20:42, Heinrich van Riel wrote:
>>>>>>>
>>>>>>>> One sec apart cloning 150GB vm from a datastore on EMC to OI.
>>>>>>>>
>>>>>>>> alloc free read write read write
>>>>>>>> ----- ----- ----- ----- ----- -----
>>>>>>>> 309G 54.2T 81 48 452K 1.34M
>>>>>>>> 309G 54.2T 0 8.17K 0 258M
>>>>>>>> 310G 54.2T 0 16.3K 0 510M
>>>>>>>> 310G 54.2T 0 0 0 0
>>>>>>>> 310G 54.2T 0 0 0 0
>>>>>>>> 310G 54.2T 0 0 0 0
>>>>>>>> 310G 54.2T 0 10.1K 0 320M
>>>>>>>> 311G 54.2T 0 26.1K 0 820M
>>>>>>>> 311G 54.2T 0 0 0 0
>>>>>>>> 311G 54.2T 0 0 0 0
>>>>>>>> 311G 54.2T 0 0 0 0
>>>>>>>> 311G 54.2T 0 10.6K 0 333M
>>>>>>>> 313G 54.2T 0 27.4K 0 860M
>>>>>>>> 313G 54.2T 0 0 0 0
>>>>>>>> 313G 54.2T 0 0 0 0
>>>>>>>> 313G 54.2T 0 0 0 0
>>>>>>>> 313G 54.2T 0 9.69K 0 305M
>>>>>>>> 314G 54.2T 0 10.8K 0 337M
>>>>>>>>
>>>>>>> ...
>>>>>>> Were it not for your complaints about link resets and "unusable"
>>>>>>> connections, I'd say this looks like a normal behavior for async
>>>>>>> writes: they get cached up, and every 5 sec you have a transaction
>>>>>>> group (TXG) sync which flushes the writes from cache to disks.
>>>>>>>
>>>>>>> In fact, the picture still looks like that, and possibly is the
>>>>>>> reason for hiccups.
>>>>>>>
>>>>>>> The TXG sync may be an IO intensive process, which may block or
>>>>>>> delay many other system tasks; previously when the interval
>>>>>>> defaulted to 30 sec we got unusable SSH connections and temporarily
>>>>>>> "hung" disk requests on the storage server every half a minute when
>>>>>>> it was really busy (i.e. initial filling up with data from older
>>>>>>> boxes). It cached up about 10 seconds worth of writes, then spewed
>>>>>>> them out and could do nothing else. I don't think I ever saw network
>>>>>>> connections timing out or NICs reporting resets due to this, but I
>>>>>>> wouldn't be surprised if this were the cause for your case, though
>>>>>>> (i.e. disk IO threads preempting HBA/NIC threads for too long
>>>>>>> somehow, making the driver very puzzled about staleness state of its card).
>>>>>>>
>>>>>>> At the very least, TXG syncs can be tuned by two knobs: the time
>>>>>>> limit (5 sec default) and the size limit (when the cache is "this"
>>>>>>> full, begin the sync to disk). The latter is a realistic figure that
>>>>>>> can allow you to sync in shorter bursts - with less interruptions
>>>>>>> to smooth IO and process work.
>>>>>>>
>>>>>>> A somewhat related tunable is the number of requests that ZFS would
>>>>>>> queue up for a disk. Depending on its NCQ/TCQ abilities and random
>>>>>>> IO abilities (HDD vs. SSD), long or short queues may be preferable.
>>>>>>> See also: http://www.solarisinternals.**com/wiki/index.php/ZFS_Evil_
>>>>>>> **Tuning_Guide#Device_I.2FO_**Queue_Size_.28I.2FO_**Concurrency.29<http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Device_I.2FO_Queue_Size_.28I.2FO_Concurrency.29>
>>>>>>>
>>>>>>> These tunables can be set at runtime with "mdb -K", as well as in
>>>>>>> the /etc/system file to survive reboots. One of our storage boxes
>>>>>>> has these example values in /etc/system:
>>>>>>>
>>>>>>> *# default: flush txg every 5sec (may be max 30sec, optimize
>>>>>>> *# for 5 sec writing)
>>>>>>> set zfs:zfs_txg_synctime = 5
>>>>>>>
>>>>>>> *# Spool to disk when the ZFS cache is 0x18000000 (384Mb) full
>>>>>>> set zfs:zfs_write_limit_override = 0x18000000
>>>>>>> *# ...for realtime changes use mdb.
>>>>>>> *# Example sets 0x18000000 (384Mb, 402653184 b):
>>>>>>> *# echo zfs_write_limit_override/**W0t402653184 | mdb -kw
>>>>>>>
>>>>>>> *# ZFS queue depth per disk
>>>>>>> set zfs:zfs_vdev_max_pending = 3
>>>>>>>
>>>>>>> HTH,
>>>>>>> //Jim Klimov
>>>>>>>
>>>>>>>
>>>>>>> ______________________________**_________________
>>>>>>> OpenIndiana-discuss mailing list
>>>>>>> OpenIndiana-discuss@**openindiana.org<OpenIndiana-discuss at openindiana.org>
>>>>>>> http://openindiana.org/**mailman/listinfo/openindiana-**discuss<http://openindiana.org/mailman/listinfo/openindiana-discuss>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


More information about the OpenIndiana-discuss mailing list