[fm-discuss] extracting fm ereport and other fm data from crashdumps
Gavin Maltby
Gavin.Maltby at Sun.COM
Wed Sep 27 03:56:57 PDT 2006
On 09/25/06 22:38, Gavin Maltby wrote:
> On 09/25/06 22:04, Victor Latushkin wrote:
>> Hello All,
>>
>> I have a crashdump from S10 6/06 box which experienced Fatal System
>> Bus error and generated crash dump. After recovery I cannot find any
>> signs of ereports saved into persistent storage. I supposed that after
>> a panic fault manager should extract such data but it looks like this
>> is not the case.
> [cut]
>
> It should do. Could you post the output of ::errorq and ::errorq -v
> run on the crash dump.
A look at the dump suggest all was set to be preserved in the dump
device and replayed on reboot:
> ::errorq
ADDR NAME S V N ACCEPT DROP LOG
300001ea040 pci_target_queue + ! 0 0 0
300001ea2c0 pci_ecc_queue + ! 0 0 0
300001ea7c0 ce_queue + 0 0 0
300001eaa40 ue_queue + ! 0 0 0
300001eb440 fm_ereport_queue + ! * 2 0 0
So the fm_ereport_queue has accepted 2 events. ::errorq -v
(not on Solaris 10) shows:
> 300001eb440::errorq -v
ADDR NAME S V N KSTAT QLEN SIZE IPL FUNC
300001eb440 fm_ereport_queue + ! * | 3072 3096 2 fm_drain
|
+-> DISPATCHED 0
DROPPED 0
LOGGED 0
RESERVED 2
RESERVE FAIL 0
COMMITTED 2
COMMIT FAIL 0
CANCELLED 0
This is the nvlist errorq (N in the ::errorq output) which will have
its members preserved on the dump device during panic. We can look
at the members, but you have to know that the data type on an
nvlist errorq is errorq_nvelem_t:
> 300001eb440::walk errorq_data | ::print errorq_nvelem_t eqn_nvl | ::nvlist
class='ereport.io.xmits.saf.parb'
ena=8052d6daf8080001
detector
version=00
scheme='dev'
device-path='/ssm at 0,0/pci at 1a'
safari-csr=0155555401a00006
safari-err=f8000000000003e0
safari-intr=80000000000fc017
safari-elog=0000000000040000
safari-pcr=0000000000000000
class='ereport.io.xmits.saf.parb'
ena=8052d7be05080001
detector
version=00
scheme='dev'
device-path='/ssm at 0,0/pci at 1a'
safari-csr=0155555401a00006
safari-err=f8000000000003e0
safari-intr=80000000000fc017
safari-elog=0000000000040000
safari-pcr=0000000000000000
Looking at the errorq_q structure itself, and ignoring the lengthy
embedded kstats:
> 300001eb440::print errorq_t
{
eq_name = [ "fm_ereport_queue" ]
eq_kstat = {
...
}
eq_ksp = kstat_initial+0xf1c8
eq_func = fm_drain
eq_private = 0
eq_data = 0x30001800000
eq_qlen = 0xc00
eq_size = 0xc18
eq_ipl = 0x2
eq_flags = 0x30001
eq_id = 0x60004494de8
eq_lock = {
_opaque = [ 0 ]
}
eq_elems = 0x30000ab6000
eq_phead = 0
eq_ptail = 0
eq_pend = 0x30000ab6020
eq_free = 0x30000ab6040
eq_dump = 0
eq_next = 0
}
What is odd is that eq_dump is NULL. What is supposed to happen is:
- panicsys() calls errorq_panic, unconditionally
- errorq_panic calls errorq_panic_drain; errorq_panic_drain or
errorq_drain which it calls should notice that we're in panic
and should link the event onto the eq_dump list
- later in dumpsys() processing we dig up all these elements on the
eq_dump lists of the errorqs and write them out to the dump device
- on subsequent fmd startup at boot it replays these events
With eq_pend != NULL it looks like errorq_drain was not called, or that
the event arrived after that drain - which seems unlikely since
it is this interrupt thread which enqueued it and called fm_panic.
That could maybe be explained if two threads chose to panic
at much the same time and the other thread (to this interrupt)
went through panicsys and errorq_drain before we enqueued the events;
but this interrupt thread is the one which performed the panic,
and nobody else appears to have initiated a panic, anyway.
::msgbuf shows
panic[cpu512]/thread=2a10767dcc0:
Fatal System Bus Error has occurred
syncing file systems...
...
2
done (not all i/o completed)
dumping to /dev/md/dsk/d106, offset 11019419648, content: kernel
So I can't quite explain what has happened.
Could you log a bug on this in category kernel/ras and attach the
core dump to it (not just a pointer to it, but a full attachment).
Thanks
Gavin
More information about the fm-discuss
mailing list