[fm-discuss] help locating problems... (fwd)

Cynthia McGuire cindi at sun.com
Thu Aug 10 18:06:52 PDT 2006


Thanks for forwarding this, Tim.  I seem to have been dropped from the 
alias.


>>> ---------- Forwarded message ----------
>>> Date: Wed, 09 Aug 2006 21:06:43 -0700
>>> From: Garrett D'Amore <garrett_damore at tadpole.com>
>>> To: Tim Haley <timh at central.sun.com>
>>> Cc: fm-discuss at opensolaris.org
>>> Subject: Re: [fm-discuss] help locating problems...
>>>
>>> Tim Haley wrote:
>>>>
>>>> And regarding the "code that generates the fault", there isn't one
>>>> particular call in the kernel that generates that fault.  Instead code
>>>> in the kernel has generated one or more ereports (short for error
>>>> reports).  Those error reports have come from the kernel to the
>>>> user-land fmd.  The fmd has recorded them in its error log.  An fmd
>>>> plugin, eft, has then coalesced those ereports and decided what fault
>>>> can cause the observed symptoms.  Eft has then published something
>>>> called a suspect list, which in this case has one entry, the fault you
>>>> are seeing.  The suspect list was broadcast to another plugin, the
>>>> syslog messaging agent, which should have put a summary of this "case"
>>>> into syslog and onto the console.
>>>
>>> I already more-or-less gathered this high level description.
>>>
>>> What I'm having a hard time is figuring out which code is sending the
>>> named fault.  The current fault framework makes it very hard (at least
>>> to me) to identify this by, for example, grepping for
>>> fault.io.pci.device-interr in the framework.
>>>
>>> Is there some easy correlation where I can find things like these
>>> strings in a call from (for example) the PCI nexus driver?
>>>
>>> It would be really, really helpful to be able to go from a the logged
>>> message to a line of code somewhere.   The notion "well, you had a
>>> random PCI fault on this particular piece of hardware" is really useful
>>> if I am a sysadmin and need to replace the hardware.  But as an engineer
>>> developing hardware platforms or writing device driver code, this is a
>>> lot less useful.

As Tim and Gavin have already mentioned, the code that generates the 
fault event (fault.io.pci.device-interr) is part of the fault manager 
and in this particular case, the eft diagnosis engine according to the 
rules described in usr/src/cmd/fm/eversholt/files/common/pci.esc.

The fault manager and its diagnosis engines take as input, ereport 
events that are typically generated from the kernel, chews on the event 
content a bit and publishes a diagnosis the result of which is what we 
call a list of suspected faults (or list.supect).  In this particular 
case, our diagnosis software has determined with 100% certainty that 
there is a problem with the PCI device at pci at 1f,700000/network at 3 
located on the motherboard.  This is likely an embedded NIC.

As a platform or driver developer, you want to understand how error 
detection and reporting is working.  Well, many of the cmn_err() 
messages were converted to generate ereport events for diagnosis.  There 
are two new functions, to do this: pci_report_post() and 
ddi_fm_ereport_post().  pci_ereport_post() captures and reports errors 
in the PCI config status register and for PCI Express the Advanced Error 
Registers.  ddi_fm_ereport_post() allows drivers to generate 
device-specific ereports.

If you run the command fmdump -V -u 
35fc7dee-e5b9-6028-d333-cbcd5a272c35, you should see a listing of the 
ereport events that diagnosis software used determine the type of fault 
that occured.  fmdump -eV will display even more ereport payload 
information.  So instead of trudging through /var/adm/messages for 
haphazard error messages from drivers, you may use these new commands to 
get information using a well-defined protocol for describing errors.

>>>
>>> (In this particular case, we have seen some other platform-wide problems
>>> with ethernet on this particular device -- too many packet drops in
>>> sunvts for example.  We don't know why, but I'd really like to be able
>>> to correlate this to what some driver thinks is going wrong, because
>>> then I stand a much better chance of correlating it to a problem that
>>> may exist with, for example, the design of the platform (maybe some kind
>>> of electrical problem or a mis-connected trace or somesuch.)

And I think that is exactly what you see here.  The diagnosis software 
has determined that device pci at 1f,700000/network at 3 is broken and in need 
of replacement.  In looking at the knowledge article content for 
PCI-8000-7J, I don't think we do a very good job of describing the 
problem and that should be fixed.  What I would like to see is a better 
description of the fault type and if other side-effect symptoms that may 
be observed (i.e. dropped packets).  Feel free to add feedback at 
http://sun.com/msg/CI-8000-7J keeping in mind that the primary reader of 
the content is an administrator or field service.

As for correlating other types of errors such as too many dropped 
packets.  You can do that too.  The NIC driver may be instrumented to 
generate it's own ereport that can be used in diagnosis rules that you 
can develop for your platform or device.  If you want a sneak-peak at 
what a FMA-aware NIC driver might look like, see bge.  It is now 
available in OpenSolaris.

We don't quite have all the documentation in place to tell you how 
develop FMA aware drivers of your own but it's coming very soon.

>>>
>>> In the old days, code that just did cmn_err() was a bit easier because
>>> we could grep for specific strings.  Now, with the FMA stuff, that
>>> doesn't seem to work anymore.

You can still do that.  Use fmdump -V or fmdump -e to get the ereport 
class names.  You can then grep the source code for those strings. 
Hint: just use the leaf ereport class (i.e. the last string after the 
last dot).


Cindi



More information about the fm-discuss mailing list