[driver-discuss] Driver trouble on Nevada...

Kyle McDonald KMcDonald at Egenera.COM
Thu Nov 12 12:22:51 PST 2009


Carson Tan wrote:
> Hi Kyle,
>
> Thanks for your great effort on this. It really make sense, and I 
> really appreciate it.
>
> I have checked your disassembly of the disconnection point, and it 
> looks the same as mine. But
> it's hard to say what's the root cause right now. I am still waiting 
> for the document from IBM.
> Meanwhile, I am trying to find out which previous build of Nevada 
> works well, as that will be
> much easier for me to find the differences.
>
> Any update, I will let you know.
>
Just for curiosity's sake, I stepped through the same section of code on 
S10u8 (output below) and wouldn't you know it, it did disconnect. So, 
while I'm not sure what it means, I think it's telling us something. Why 
on S10 would it work when running freely outside the debugger, and 
disconnect when stepping through the code? And in NV it disconnects 
inside and out of the debugger?

All I can think of is a timing issue. Something in the timing of S10 
allows it to avoid disconnecting when running full tilt?

Anyway, it's food for thought.

  -Kyle



S10u8 booting:
ucode0 is /pseudo/ucode at 0
pseudo-device: fssnap0
fssnap0 is /pseudo/fssnap at 0
pseudo-device: winlock0
winlock0 is /pseudo/winlock at 0
pseudo-device: vol0
vol0 is /pseudo/vol at 0
pseudo-device: pm0
pm0 is /pseudo/pm at 0
pseudo-device: rsm0
rsm0 is /pseudo/rsm at 0
pseudo-device: pool0
pool0 is /pseudo/pool at 0
Hostname: Einstein03
dump on /dev/zvol/dsk/zroot0/dump size 4096 MB
NIS domain name is Engineering.NIS
Loaded modules: [ crypto cpc ptm ufs sppp lofs logindmux md random ]
kmdb: stop at bge`bge_attach
kmdb: target stopped at:
bge`bge_attach: pushq  %rbp
[3]> bge_asf_pre_reset_operations:b
[3]> :c
kmdb: stop at bge`bge_asf_pre_reset_operations
kmdb: stop at bge`bge_asf_pre_reset_operations
kmdb: target stopped at:
bge`bge_asf_pre_reset_operations:       pushq  %rbp
[3]> ::step over
kmdb: target stopped at:
bge`bge_asf_pre_reset_operations+1:     movl   $0x2,%edx
[3]> ::step over
kmdb: target stopped at:
bge`bge_asf_pre_reset_operations+6:     movq   %rsp,%rbp
[3]> ::step over
kmdb: target stopped at:
bge`bge_asf_pre_reset_operations+9:     pushq  %r13
[3]> ::step over
kmdb: target stopped at:
bge`bge_asf_pre_reset_operations+0xb:   movl   %esi,%r13d
[3]> ::step over
kmdb: target stopped at:
bge`bge_asf_pre_reset_operations+0xe:   movl   $0xb78,%esi
[3]> ::step over
kmdb: target stopped at:
bge`bge_asf_pre_reset_operations+0x13:  pushq  %r12
[3]> ::step over
kmdb: target stopped at:
bge`bge_asf_pre_reset_operations+0x15:  movq   %rdi,%r12
[3]> ::step over
kmdb: target stopped at:
bge`bge_asf_pre_reset_operations+0x18:  pushq  %rbx
[3]> ::step over
kmdb: target stopped at:
bge`bge_asf_pre_reset_operations+0x19:  xorl   %ebx,%ebx
[3]> ::step over
kmdb: target stopped at:
bge`bge_asf_pre_reset_operations+0x1b:  subq   $0x8,%rsp
[3]> ::step over
kmdb: target stopped at:
bge`bge_asf_pre_reset_operations+0x1f:  call   -0x448f  <bge`bge_nic_put32>
[3]> ::step over
kmdb: target stopped at:
bge`bge_asf_pre_reset_operations+0x24:  movl   $0x6810,%esi
[3]> ::step over
kmdb: target stopped at:
bge`bge_asf_pre_reset_operations+0x29:  movq   %r12,%rdi
[3]> ::step over
kmdb: target stopped at:
bge`bge_asf_pre_reset_operations+0x2c:  call   -0x480c  <bge`bge_reg_get32>
[3]> ::step over
kmdb: target stopped at:
bge`bge_asf_pre_reset_operations+0x31:  movl   %eax,%edx
[3]> ::step over
kmdb: target stopped at:
bge`bge_asf_pre_reset_operations+0x33:  movl   $0x6810,%esi
[3]> ::step over
kmdb: target stopped at:
bge`bge_asf_pre_reset_operations+0x38:  movq   %r12,%rdi
[3]> ::step over
kmdb: target stopped at:
bge`bge_asf_pre_reset_operations+0x3b:  orb    $0x40,%dh
[3]> ::step over
kmdb: target stopped at:
bge`bge_asf_pre_reset_operations+0x3e:  call   -0x47fe  <bge`bge_reg_put32>
[3]>
system> console -T blade[3]       
SOL is not ready
system>

> Thanks again,
> Carson
>
>
> Kyle McDonald wrote:
>> Carson Tan wrote:
>>> Hi Minskey and Kyle,
>>>
>>> Thanks for all your discussion on this.
>>>
>>> I found that the SOL session is gone after executing the following 
>>> code in bge_asf_pre_reset_operations:
>>> bge_reg_put32(bgep, RX_RISC_EVENT_REG, event | RRER_ASF_EVENT);
>>>
>> Hi,
>>
>> I've never done driver development, so if I'm way off base just say 
>> so....
>>
>> The code line you quote above is part of:
>>
>>  5759     event = bge_reg_get32 
>> <http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/io/bge/bge_chip2.c#bge_reg_get32>(bgep, 
>> RX_RISC_EVENT_REG 
>> <http://src.opensolaris.org/source/s?defs=RX_RISC_EVENT_REG&project=/onnv>); 
>>
>>   5760     bge_reg_put32 
>> <http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/io/bge/bge_chip2.c#bge_reg_put32>(bgep, 
>> RX_RISC_EVENT_REG 
>> <http://src.opensolaris.org/source/s?defs=RX_RISC_EVENT_REG&project=/onnv>, 
>> event | RRER_ASF_EVENT 
>> <http://src.opensolaris.org/source/s?defs=RRER_ASF_EVENT&project=/onnv>); 
>>
>>
>>
>> Is this section of code atomic?
>> Can the HW change the register on it's own?
>>
>> The failure is 100% reproducible, and not intermittent, so I normally 
>> wouldn't consider a race condition right away, but it occurred to me 
>> that any changes to the register between the get and the put would be 
>> lost by this code.
>>
>> Poking around, I also noticed this function:
>>
>>    574 *void*
>>    575 bge_reg_set32 
>> <http://src.opensolaris.org/source/s?refs=bge_reg_set32&project=/onnv>(bge_t 
>> <http://src.opensolaris.org/source/s?defs=bge_t&project=/onnv> *bgep, 
>> bge_regno_t 
>> <http://src.opensolaris.org/source/s?defs=bge_regno_t&project=/onnv> 
>> regno, uint32_t 
>> <http://src.opensolaris.org/source/s?defs=uint32_t&project=/onnv> 
>> bits <http://src.opensolaris.org/source/s?defs=bits&project=/onnv>)
>>    576 {
>>    577     uint32_t 
>> <http://src.opensolaris.org/source/s?defs=uint32_t&project=/onnv> 
>> regval <http://src.opensolaris.org/source/s?refs=regval&project=/onnv>;
>>    578    579     BGE_TRACE 
>> <http://src.opensolaris.org/source/s?defs=BGE_TRACE&project=/onnv>(("bge_reg_set32($%p, 
>> 0x%lx, 0x%x)",
>>    580         (*void* *)bgep, regno, bits 
>> <http://src.opensolaris.org/source/s?defs=bits&project=/onnv>));
>>    581    582     regval = bge_reg_get32 
>> <http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/io/bge/bge_chip2.c#bge_reg_get32>(bgep, 
>> regno);
>>    583     regval |= bits 
>> <http://src.opensolaris.org/source/s?defs=bits&project=/onnv>;
>>    584     bge_reg_put32 
>> <http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/io/bge/bge_chip2.c#bge_reg_put32>(bgep, 
>> regno, regval);
>>    585 }
>>    586
>>
>> I don't know if it would be any better protected than the existing 
>> code above, but it seems like the code above could have been 
>> re-written as:
>>
>> bge_reg_set32 
>> <http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/io/bge/bge_chip2.c#bge_reg_put32>(bgep, 
>> RX_RISC_EVENT_REG 
>> <http://src.opensolaris.org/source/s?defs=RX_RISC_EVENT_REG&project=/onnv>, 
>> RRER_ASF_EVENT 
>> <http://src.opensolaris.org/source/s?defs=RRER_ASF_EVENT&project=/onnv>); 
>>
>>
>>
>> Am I missing something?
>>
>>
>> Also I noticed several parts of bge_main2.c (line 634) and 
>> bge_chip2.c (lines 4367,4714)that specifically mention problems with 
>> the IBM BladeCenter HS20 blade. Nothing discussed there seemed 
>> immediately obvious to me, but since you said the code in the area 
>> that triggers the disconnect hasn't changed since S10, I'm wondering 
>> if any of these areas that mention the HS20 have changed since S10?
>>
>> Maybe a problem created by a change in one of them doesn't rear it's 
>> head until we get to the code we're all looking at?
>>
>>  -Kyle
>>
>
>



More information about the driver-discuss mailing list