[storage-discuss] Re: Re: Re: iscsi failure from Windows with b61

Nigel Smith nwsmith at wilusa.freeserve.co.uk
Mon Apr 16 12:52:28 PDT 2007


I've been doing some further testing with the Microsoft iScsi initiator,
and the Solaris iScsi target. Everything seemed to be going ok,...until it crashed!

My test hardware has improved since I last reported.
I'm now using GigaBit Ethernet.
The Solaris snv_60 is running on a Pentium-D dual-core 3GHz, with 1GB RAM.
The initiator is running Windows Server 2003 R2 on a Pentium III, 930MHz, 512MB RAM.

I tried copying files using Windows Explorer, and did not have any problem.
So I tried again with my old favourite, the SQLIOStress utility from Microsoft.
I've gone back to using the old version 4.00.050, (non-GUI command line).
I did one run of 16 iterations overnight for about 8 hours, with no problems.
I set it going again as I went to work this morning, & when I arrive home
tonight it had crashed, on iteration 12 of 16.

Windows displays a message box titled "Windows - Delayed Write Failed" saying:
"Windows was unable to save all the data for the file S:\$Mft.  The data has been lost. This error may be caused by a failure of your computer hardware or network connection. Please try to save this file elsewhere."

Over on the Solaris side, we see a core dump similar to the one
that Nicholas Lee reported:

# cat /etc/release
                            Solaris Nevada snv_60 X86
           Copyright 2007 Sun Microsystems, Inc.  All Rights Reserved.
                        Use is subject to license terms.
                             Assembled 12 March 2007
# tail -20l /var/svc/log/system-iscsitgt\:default.log
[ Apr  5 22:31:56 Executing start method ("/lib/svc/method/svc-iscsitgt start") ]
[ Apr  5 22:31:57 Method "start" exited with status 0 ]
[ Apr 16 13:31:06 Stopping because process dumped core. ]
[ Apr 16 13:31:08 Executing stop method ("/lib/svc/method/svc-iscsitgt stop 72") ]
[ Apr 16 13:31:08 Method "stop" exited with status 0 ]
[ Apr 16 13:31:09 Executing start method ("/lib/svc/method/svc-iscsitgt start") ]
[ Apr 16 13:31:10 Method "start" exited with status 0 ]
[ Apr 16 13:32:21 Stopping because process dumped core. ]
[ Apr 16 13:32:21 Executing stop method ("/lib/svc/method/svc-iscsitgt stop 128") ]
[ Apr 16 13:32:21 Method "stop" exited with status 0 ]
[ Apr 16 13:32:21 Executing start method ("/lib/svc/method/svc-iscsitgt start") ]
[ Apr 16 13:32:21 Method "start" exited with status 0 ]
[ Apr 16 13:33:12 Stopping because process dumped core. ]
[ Apr 16 13:33:12 Executing stop method ("/lib/svc/method/svc-iscsitgt stop 130") ]
[ Apr 16 13:33:12 Method "stop" exited with status 0 ]
[ Apr 16 13:33:12 Executing start method ("/lib/svc/method/svc-iscsitgt start") ]
[ Apr 16 13:33:12 Method "start" exited with status 0 ]
# cd /var/crash/solaris/
# ls -l
total 324416
-rw-------   1 root     root     18352886831 Apr 16 13:32 core.iscsitgtd.10030
-rw-------   1 root     root     18353079287 Apr 16 13:33 core.iscsitgtd.10040
-rw-------   1 root     root     15627199 Apr 16 13:31 core.iscsitgtd.471
# mdb core.iscsitgtd.471
Loading modules: [ libumem.so.1 libavl.so.1 libc.so.1 libnvpair.so.1 libuutil.so.1 ld.so.1 ]
> $c
libc.so.1`_lwp_kill+0xa()
libc.so.1`raise+0x19()
libc.so.1`abort+0xd1()
libc.so.1`_assert+0x74()
t10_cmd_state_machine+0x15b()
t10_handle_destroy+0xd1()
sess_from_t10+0xad()
libc.so.1`_thr_setup+0x67()
libc.so.1`_lwp_start()
>
# mdb core.iscsitgtd.10030
mdb: core file data for mapping at fffffd7bb8400000 not saved: I/O error
Loading modules: [ libumem.so.1 libavl.so.1 libc.so.1 libnvpair.so.1 libuutil.so.1 ld.so.1 ]
> $c
libc.so.1`memcpy+0x199()
trans_rqst_dataout+0x127()
sbc_write+0x3cf()
sbc_cmd+0x21()
lu_runner+0x7bc()
libc.so.1`_thr_setup+0x67()
libc.so.1`_lwp_start()
>
# mdb core.iscsitgtd.10040
mdb: core file data for mapping at fffffd7bb8600000 not saved: I/O error
Loading modules: [ libumem.so.1 libavl.so.1 libc.so.1 libnvpair.so.1 libuutil.so.1 ld.so.1 ]
> $c
libc.so.1`memcpy+0x199()
trans_rqst_dataout+0x127()
sbc_write+0x3cf()
sbc_cmd+0x21()
lu_runner+0x7bc()
libc.so.1`_thr_setup+0x67()
libc.so.1`_lwp_start()
>

I am still using the 'debug' version of the Microsoft initiator, and it logged alot of output 
in the window of 'DebugView'.  I've included this below, in case it gives any clues,
but unfortunately it may only make sense to the author of the initiator code!

00001711	485882.09375000	ResetLU : Session - F9358000, Request Count - 56	
00001712	485882.09375000	Will replace CmdSN e3549d with ExpCmdSN e35465	
00001713	485882.09375000	Resets TR = 0 LR = 1 for F9358000 Lun = 0	
00001714	485882.12500000	Session F9358000, completed zombie request F9D43000 with srb status 0xe	
00001715	485882.12500000	Session F9358000, completed zombie request F9D430D8 with srb status 0xe	
00001716	485882.12500000	Session F9358000, completed zombie request F9D431B0 with srb status 0xe	
<--snip-Many similar lines-->
00001767	485882.12500000	Session F9358000, completed zombie request F9D462A0 with srb status 0xe	
00001768	485882.12500000	Session F9358000, completed zombie request F9D46378 with srb status 0xe	
00001769	485882.12500000	Session F9358000, completed zombie request F9D46450 with srb status 0xe	
00001770	485907.09375000	ResetLU : Session - F9358000, Request Count - 63	
00001771	485907.09375000	Resets TR = 0 LR = 2 for F9358000 Lun = 0	
00001772	485927.09375000	Will free F8CBFC30 after send completes. Function 7	
00001773	485927.09375000	Session F9358000, completed zombie request F9D43000 with srb status 0xe	
00001774	485927.09375000	Session F9358000, completed zombie request F9D430D8 with srb status 0xe	
<--snip-Many similar lines-->
00001834	485927.09375000	Session F9358000, completed zombie request F9D46378 with srb status 0xe	
00001835	485927.09375000	Session F9358000, completed zombie request F9D46450 with srb status 0xe	
00001836	485927.09375000	At TMF timeout completing reset srb FF53F360 for Session F9358000	
00001837	485927.75000000	Processing unexpected PDU for F949E6E0	
00001838	485952.15625000	ResetLU : Session - F9358000, Request Count - 1	
00001839	485952.15625000	Will replace CmdSN e354a5 with ExpCmdSN e3549d	
00001840	485952.15625000	Resets TR = 0 LR = 3 for F9358000 Lun = 0	
00001841	485952.15625000	Invalid header F949E914	
00001842	485952.15625000	Ignoring receive. Connection state 2, ProtocolState 14	
00001843	485952.15625000	Unable to process data given. Available 17, Indicated 17	
00001844	485952.15625000	ResetLU : Session - F9358000, Request Count - 1	
00001845	485952.15625000	Free All Entries for F949E6E0: Found entry F8C90E30 (7)	
00001846	485952.15625000	AR FA020950 for timeout. Function 7	
00001847	485952.15625000	At TMF timeout completing reset srb FFB968A0 for Session F9358000	
00001848	485952.15625000	Premature Termination : Okay to remove connection F949E6E0	
00001849	485952.15625000	Premature Termination : Okay to remove session F9358000	
00001850	485953.62500000	[1320] 04/16/2007 12:32: 1.00000377:[     528.     574] iscsidsc: Did not find target portal 172.16.8.38/3260 0xefff000b	
00001851	485967.12500000	Disconnect time NOT exceeded. Srb F8C2F1D4 [0x2a]	
00001852	485977.15625000	Disconnect time NOT exceeded. Srb F8FD63E4 [0x2a]	
00001853	485983.12500000	FF64D3C8 timed out waiting for relogin response	
00001854	485983.12500000	Will queue login worker on timeout for FF64D3C8	
00001855	485984.18750000	[1320] 04/16/2007 12:32:32.00000001:[     528.     574] iscsidsc: Did not find target portal 172.16.8.38/3260 0xefff000b	
00001856	485984.18750000	Relogin. Reconnect count - 1	
00001857	486017.18750000	ResetLU : Session - F9358000, Request Count - 1	
00001858	486017.18750000	Resets TR = 0 LR = 1 for F9358000 Lun = 0	
00001859	486022.31250000	RaidSendIrpSynchronous (FA478C78) irp FF5AB340 did not complete within 1e seconds	
00001860	486035.12500000	Will retry lost connection F9206B70 (9). ERL 0	
00001861	486035.12500000	ResetLU : Session - F9358000, Request Count - 1	
00001862	486035.12500000	Free All Entries for F9206B70: Found entry 8177AE20 (7)	
00001863	486035.12500000	AR F8CBFC30 for timeout. Function 7	
00001864	486035.12500000	At TMF timeout completing reset srb FF0D44B0 for Session F9358000	
00001865	486035.12500000	Premature Termination : Okay to remove connection F9206B70	
00001866	486035.12500000	Premature Termination : Okay to remove session F9358000	
00001867	486035.12500000	[1320] 04/16/2007 12:33:22.00000944:[     528.     574] iscsidsc: Did not find target portal 172.16.8.38/3260 0xefff000b	
00001868	486035.56250000	Relogin. Reconnect count - 1	
00001869	486065.56250000	RaidSendIrpSynchronous (FA478C78) irp FF5AB340 did not complete within 1e seconds	
00001870	486068.18750000	ResetLU : Session - F9358000, Request Count - 63	
00001871	486068.18750000	Will replace CmdSN e354e1 with ExpCmdSN e354a4	
00001872	486068.18750000	Resets TR = 0 LR = 1 for F9358000 Lun = 0	
00001873	486068.18750000	Failed to match unit FA478D30	
00001874	486086.46875000	Will retry lost connection F9D2EB70 (9). ERL 0	
00001875	486086.46875000	ResetLU : Session - F9358000, Request Count - 63	
00001876	486086.46875000	Free All Entries for F9D2EB70: Found entry FF5E11F8 (7)	
00001877	486086.46875000	Will free F9115780 after send completes. Function 7	
00001878	486086.46875000	At TMF timeout completing reset srb FF0D44B0 for Session F9358000	
00001879	486086.46875000	Will Not process request recovery in progress for F9358000	
00001880	486086.46875000	Disconnect time NOT exceeded. Srb FA2ED2C4 [0x2a]	
00001881	486086.46875000	Will Not process request recovery in progress for F9358000	
00001882	486086.46875000	Disconnect time NOT exceeded. Srb FA2ED2C4 [0x2a]	
<--snip - 100's of similar lines-->
00053756	486087.65625000	Disconnect time NOT exceeded. Srb FA2ED2C4 [0x2a]	
00053757	486087.65625000	Will Not process request recovery in progress for F9358000	
00053758	486087.65625000	Disconnect time NOT exceeded. Srb FA2ED2C4 [0x2a]	
00053759	486087.65625000	Will Not process request recovery in progress for F9358000	
00053760	486087.65625000	Disconnect time NOT exceeded. Srb FA2ED2C4 [0x2a]	
02603103	486146.75000000	Relogin. Reconnect count - 1	
02603104	486147.34375000	[1320] 04/16/2007 12:35:14.00000564:[     528.     574] iscsidsc: Did not find target portal 172.16.8.38/3260 0xefff000b	
<--Followed by repeating attempts to re-login, that fail-->

Rick - have you any thoughts on what may be happening here?
Is there anything further I can do to assist?
I will try to see if I can trigger the problem again while capturing packets,
but that will be difficult!
Thanks
Nigel Smith
 
 
This message posted from opensolaris.org



More information about the storage-discuss mailing list