[fm-discuss] Re: [osol-discuss] NV57 fails on bootup on Athlon 64 3500

Gavin Maltby Gavin.Maltby at Sun.COM
Tue Feb 20 05:23:47 PST 2007


[BCC'd to internal fma interest alias previously on CC list;
CC'd fm-discuss at opensolaris.org instead]

Hi,

What type of system is this?  Is it a rev F system?
You said you had build 48 working on this system before.  From the fma
point of view there was nothing new in builds 47-50, but in build 51
I added rev F support and improved/extended a bunch of the existing
rev B-E support.  If yours is a rev F system (eg socket AM2) then
prior to build 51 there would have been no real FMA support, and
in 51 and later you'd have the full support.  That would include
enabling the NorthBridge watchdog if the BIOS has not already
enabled it.

On 02/17/07 04:39, James C. McPherson wrote:
> Jeremy Teo wrote:
>> a successful install of NV57 results in a panic after bootup: error
>> messages as shown below (painstakingly captured by hand)
>>
>> ereport.cpu.amd.nb.wdog ena=204cae318e00001 detector=[ version=0
>> scheme="hc" hc-list=[...] ] bank_status=b200000000070f0f bank-number=4
>> addr=80180c080 add-valid=0 ip=0 privileged=1

This is a NB watchdog timeout error.  The watchdog monitors HyperTransport
requests for which a response is expected - if they take too long
then it aborts the transaction.  We must panic because we don't know
who has lost state because of the error.

The address 80180c080 is not a physical address in this case but instead
gives us some detail of the transaction.  See BKDG 3.6.4.7:

  - operation type "normal" (not bus lock, apic access, etc)
  - next action would have been to send final response to requestor
  - source node: 0, source unitid: 0, SrcPtr: 1 (CPU on local node)
  - destination node: 0, dest unitid: 3 (hostbridge)
  - not waiting for a posted write
  - waitcode: no waiting condition

... which just tells us that this was an IO access from node 0 core 0
to the IO hostbridge on the same node, and there was no response
(ultimately from whatever further downstream was supposed to
respond).

>> panic[cpu0]/thread=d2dcbde0: Unrecoverable Machine Check Exception
>>
>> d2dcbc9c unix:cmi_mca_trap+46 (d2dcbca8)
>> d2dcbca8 unix:mcetrap+5a (fec301b0, d96f0000,)
>> d2dcbd00 unix:ddi_io_get8+13 )d5baac00) d2dcbd14 ata:ata_get_status+6f 
>> (d5baac00, d2dcbd5c)
>  > d2dcbd40 ata:ghd_intr+47 (d5baac94, d2dcbd5c)
>  > d2dcbd60 ata:ata_intr+22 (d5baac00,0)
>> d2dcbdac unix:av_dispatch_autovect+69 (f) 
>  > d2dcbdcc unix:dispatch_hardint+1a(f,0)
>>
>> Any ideas? I've tried force-booting into 32 bit mode, disabling the CPU
>> specific module (mentioned by gavinm some time back), as well as 
>> disabling
>> HAL, (suspecting that it may be due to bug 6491248

If you have disabled the amd cpu module (set cmi_force_generic=1
in /etc/system after snv_51) then it looks like your
BIOS is enabling the watchdog (as it probably should).  You could
maybe find a bios menu option to disable it - but you may find
that is then rewarded with a hang instead of a panic (no watchdog
to abort the transaction).  You can also lengthen the watchdog
timeout interval.

Solaris (snv_51 and later) gives some control over our watchdog
enabling policy.  See ao_mca.c:

/*
  * Bits to be used if we configure the NorthBridge (NB) Watchdog.  The watchdog
  * triggers a machine check exception when no response to an NB system access
  * occurs within a specified time interval.
  */
uint32_t ao_nb_cfg_wdog =
     AMD_NB_CFG_WDOGTMRCNTSEL_4095 |
     AMD_NB_CFG_WDOGTMRBASESEL_1MS;

/*
  * The default watchdog policy is to enable it (at the above rate) if it
  * is disabled;  if it is enabled then we leave it enabled at the rate
  * chosen by the BIOS.
  */
enum {
         AO_NB_WDOG_LEAVEALONE,          /* Don't touch watchdog config */
         AO_NB_WDOG_DISABLE,             /* Always disable watchdog */
         AO_NB_WDOG_ENABLE_IF_DISABLED,  /* If disabled, enable at our rate */
         AO_NB_WDOG_ENABLE_FORCE_RATE    /* Enable and set our rate */
} ao_nb_watchdog_policy = AO_NB_WDOG_ENABLE_IF_DISABLED;

You could play with those values in /etc/system to see if any help. For
example to force watchdog disable even if bios has enabled it:

set cpu\.AuthenticAMD\.15:ao_nb_watchdog_policy = 1

You could also set ao_nb_cfg_wdog to give a longer timeout period (if disabling
it proves effective try stretching the interval).

Gavin



More information about the fm-discuss mailing list