[fm-discuss] extracting fm ereport and other fm data from crashdumps

Gavin Maltby Gavin.Maltby at Sun.COM
Wed Sep 27 03:56:57 PDT 2006


On 09/25/06 22:38, Gavin Maltby wrote:
> On 09/25/06 22:04, Victor Latushkin wrote:
>> Hello All,
>>
>> I have a crashdump from S10 6/06 box which experienced Fatal System 
>> Bus error and generated crash dump. After recovery I cannot find any 
>> signs of ereports saved into persistent storage. I supposed that after 
>> a panic fault manager should extract such data but it looks like this 
>> is not the case.
> [cut]
> 
> It should do.  Could you post the output of ::errorq and ::errorq -v
> run on the crash dump.

A look at the dump suggest all was set to be preserved in the dump
device and replayed on reboot:

 > ::errorq
ADDR        NAME             S V N  ACCEPT    DROP     LOG
300001ea040 pci_target_queue + !         0       0       0
300001ea2c0 pci_ecc_queue    + !         0       0       0
300001ea7c0 ce_queue         +           0       0       0
300001eaa40 ue_queue         + !         0       0       0
300001eb440 fm_ereport_queue + ! *       2       0       0

So the fm_ereport_queue has accepted 2 events.  ::errorq -v
(not on Solaris 10) shows:

 > 300001eb440::errorq -v
ADDR        NAME             S V N KSTAT   QLEN   SIZE IPL             FUNC
300001eb440 fm_ereport_queue + ! *   |     3072   3096   2 fm_drain
                                      |
                                      +->   DISPATCHED 0
                                               DROPPED 0
                                                LOGGED 0
                                              RESERVED 2
                                          RESERVE FAIL 0
                                             COMMITTED 2
                                           COMMIT FAIL 0
                                             CANCELLED 0

This is the nvlist errorq (N in the ::errorq output) which will have
its members preserved on the dump device during panic.  We can look
at the members, but you have to know that the data type on an
nvlist errorq is errorq_nvelem_t:

 > 300001eb440::walk errorq_data | ::print errorq_nvelem_t eqn_nvl | ::nvlist
class='ereport.io.xmits.saf.parb'
ena=8052d6daf8080001
detector
     version=00
     scheme='dev'
     device-path='/ssm at 0,0/pci at 1a'
safari-csr=0155555401a00006
safari-err=f8000000000003e0
safari-intr=80000000000fc017
safari-elog=0000000000040000
safari-pcr=0000000000000000
class='ereport.io.xmits.saf.parb'
ena=8052d7be05080001
detector
     version=00
     scheme='dev'
     device-path='/ssm at 0,0/pci at 1a'
safari-csr=0155555401a00006
safari-err=f8000000000003e0
safari-intr=80000000000fc017
safari-elog=0000000000040000
safari-pcr=0000000000000000

Looking at the errorq_q structure itself, and ignoring the lengthy
embedded kstats:

 > 300001eb440::print errorq_t
{
     eq_name = [ "fm_ereport_queue" ]
     eq_kstat = {
	...
     }
     eq_ksp = kstat_initial+0xf1c8
     eq_func = fm_drain
     eq_private = 0
     eq_data = 0x30001800000
     eq_qlen = 0xc00
     eq_size = 0xc18
     eq_ipl = 0x2
     eq_flags = 0x30001
     eq_id = 0x60004494de8
     eq_lock = {
         _opaque = [ 0 ]
     }
     eq_elems = 0x30000ab6000
     eq_phead = 0
     eq_ptail = 0
     eq_pend = 0x30000ab6020
     eq_free = 0x30000ab6040
     eq_dump = 0
     eq_next = 0
}

What is odd is that eq_dump is NULL.  What is supposed to happen is:

  - panicsys() calls errorq_panic, unconditionally
  - errorq_panic calls errorq_panic_drain; errorq_panic_drain or
    errorq_drain which it calls should notice that we're in panic
    and should link the event onto the eq_dump list
  - later in dumpsys() processing we dig up all these elements on the
    eq_dump lists of the errorqs and write them out to the dump device
  - on subsequent fmd startup at boot it replays these events

With eq_pend != NULL it looks like errorq_drain was not called, or that
the event arrived after that drain - which seems unlikely since
it is this interrupt thread which enqueued it and called fm_panic.
That could maybe be explained if two threads chose to panic
at much the same time and the other thread (to this interrupt)
went through panicsys and errorq_drain before we enqueued the events;
but this interrupt thread is the one which performed the panic,
and nobody else appears to have initiated a panic, anyway.

::msgbuf shows

panic[cpu512]/thread=2a10767dcc0:
Fatal System Bus Error has occurred

syncing file systems...
...
  2
  done (not all i/o completed)
dumping to /dev/md/dsk/d106, offset 11019419648, content: kernel


So I can't quite explain what has happened.
Could you log a bug on this in category kernel/ras and attach the
core dump to it (not just a pointer to it, but a full attachment).

Thanks

Gavin



More information about the fm-discuss mailing list