[fm-discuss] help locating problems...

Gavin Maltby gavin.maltby at sun.com
Thu Aug 10 13:50:35 PDT 2006


Garrett D'Amore wrote:

> In the old days, code that just did cmn_err() was a bit easier because
> we could grep for specific strings.  Now, with the FMA stuff, that
> doesn't seem to work anymore.

... or you got nothing at all perhaps because the error detector was not
enabled or the blanket response to an error interupt would be to 
panic/complain with some generic message.

But the point is that the code which catches the error often/typically
is not the code which aggravated the error to begin with (if indeed
it is a driver defect as opposed to a hardware defect).  It is typically
an error interrupt handler or trap handler.  So even it it vomits
a cmn_err directly you're none the wiser.

In your case you have a fault.io.pci.device-interr.  Faults are what may
be diagnosed from the incoming ereport telemetry flow.  So you
won't find that string in the kernel but in the diagnosis
stuff.  PCI is diagnosed using the eversholt language, and the
rules for this "internal device fault" are in
usr/src/cmd/fm/eversholt/files/common/pci.esc.  This describes
a fault propogation tree - faults propogate (->) to
errors (abberant condition present, possibly not yet detected)
which propogate to ereports.  When an ereport is receieved
eversholt works backwards from the ereport - what the detector
saw - towards a fault, testing various conditions along the way.
There are a bunch of ways to get a device internal fault from
the tree.  But you can look at the ereports associated with
your fault (via fmdump and fmdump -e) and see what particular
class of ereport you did experience.

Hope that helps

Cheers

Gavin



More information about the fm-discuss mailing list