[fm-discuss] help locating problems... (fwd)
Garrett D'Amore
garrett_damore at tadpole.com
Thu Aug 10 19:19:16 PDT 2006
Cynthia McGuire wrote:
> Thanks for forwarding this, Tim. I seem to have been dropped from the
> alias.
Thanks Cindi! This is helpful maybe now I can figure out what exactly
is wrong. :-)
-- Garrett
>
>
>>>> ---------- Forwarded message ----------
>>>> Date: Wed, 09 Aug 2006 21:06:43 -0700
>>>> From: Garrett D'Amore <garrett_damore at tadpole.com>
>>>> To: Tim Haley <timh at central.sun.com>
>>>> Cc: fm-discuss at opensolaris.org
>>>> Subject: Re: [fm-discuss] help locating problems...
>>>>
>>>> Tim Haley wrote:
>>>>>
>>>>> And regarding the "code that generates the fault", there isn't one
>>>>> particular call in the kernel that generates that fault. Instead
>>>>> code
>>>>> in the kernel has generated one or more ereports (short for error
>>>>> reports). Those error reports have come from the kernel to the
>>>>> user-land fmd. The fmd has recorded them in its error log. An fmd
>>>>> plugin, eft, has then coalesced those ereports and decided what fault
>>>>> can cause the observed symptoms. Eft has then published something
>>>>> called a suspect list, which in this case has one entry, the fault
>>>>> you
>>>>> are seeing. The suspect list was broadcast to another plugin, the
>>>>> syslog messaging agent, which should have put a summary of this
>>>>> "case"
>>>>> into syslog and onto the console.
>>>>
>>>> I already more-or-less gathered this high level description.
>>>>
>>>> What I'm having a hard time is figuring out which code is sending the
>>>> named fault. The current fault framework makes it very hard (at least
>>>> to me) to identify this by, for example, grepping for
>>>> fault.io.pci.device-interr in the framework.
>>>>
>>>> Is there some easy correlation where I can find things like these
>>>> strings in a call from (for example) the PCI nexus driver?
>>>>
>>>> It would be really, really helpful to be able to go from a the logged
>>>> message to a line of code somewhere. The notion "well, you had a
>>>> random PCI fault on this particular piece of hardware" is really
>>>> useful
>>>> if I am a sysadmin and need to replace the hardware. But as an
>>>> engineer
>>>> developing hardware platforms or writing device driver code, this is a
>>>> lot less useful.
>
> As Tim and Gavin have already mentioned, the code that generates the
> fault event (fault.io.pci.device-interr) is part of the fault manager
> and in this particular case, the eft diagnosis engine according to the
> rules described in usr/src/cmd/fm/eversholt/files/common/pci.esc.
>
> The fault manager and its diagnosis engines take as input, ereport
> events that are typically generated from the kernel, chews on the
> event content a bit and publishes a diagnosis the result of which is
> what we call a list of suspected faults (or list.supect). In this
> particular case, our diagnosis software has determined with 100%
> certainty that there is a problem with the PCI device at
> pci at 1f,700000/network at 3 located on the motherboard. This is likely an
> embedded NIC.
>
> As a platform or driver developer, you want to understand how error
> detection and reporting is working. Well, many of the cmn_err()
> messages were converted to generate ereport events for diagnosis.
> There are two new functions, to do this: pci_report_post() and
> ddi_fm_ereport_post(). pci_ereport_post() captures and reports errors
> in the PCI config status register and for PCI Express the Advanced
> Error Registers. ddi_fm_ereport_post() allows drivers to generate
> device-specific ereports.
>
> If you run the command fmdump -V -u
> 35fc7dee-e5b9-6028-d333-cbcd5a272c35, you should see a listing of the
> ereport events that diagnosis software used determine the type of
> fault that occured. fmdump -eV will display even more ereport payload
> information. So instead of trudging through /var/adm/messages for
> haphazard error messages from drivers, you may use these new commands
> to get information using a well-defined protocol for describing errors.
>
>>>>
>>>> (In this particular case, we have seen some other platform-wide
>>>> problems
>>>> with ethernet on this particular device -- too many packet drops in
>>>> sunvts for example. We don't know why, but I'd really like to be able
>>>> to correlate this to what some driver thinks is going wrong, because
>>>> then I stand a much better chance of correlating it to a problem that
>>>> may exist with, for example, the design of the platform (maybe some
>>>> kind
>>>> of electrical problem or a mis-connected trace or somesuch.)
>
> And I think that is exactly what you see here. The diagnosis software
> has determined that device pci at 1f,700000/network at 3 is broken and in
> need of replacement. In looking at the knowledge article content for
> PCI-8000-7J, I don't think we do a very good job of describing the
> problem and that should be fixed. What I would like to see is a
> better description of the fault type and if other side-effect symptoms
> that may be observed (i.e. dropped packets). Feel free to add
> feedback at http://sun.com/msg/CI-8000-7J keeping in mind that the
> primary reader of the content is an administrator or field service.
>
> As for correlating other types of errors such as too many dropped
> packets. You can do that too. The NIC driver may be instrumented to
> generate it's own ereport that can be used in diagnosis rules that you
> can develop for your platform or device. If you want a sneak-peak at
> what a FMA-aware NIC driver might look like, see bge. It is now
> available in OpenSolaris.
>
> We don't quite have all the documentation in place to tell you how
> develop FMA aware drivers of your own but it's coming very soon.
>
>>>>
>>>> In the old days, code that just did cmn_err() was a bit easier because
>>>> we could grep for specific strings. Now, with the FMA stuff, that
>>>> doesn't seem to work anymore.
>
> You can still do that. Use fmdump -V or fmdump -e to get the ereport
> class names. You can then grep the source code for those strings.
> Hint: just use the leaf ereport class (i.e. the last string after the
> last dot).
>
>
> Cindi
> _______________________________________________
> fm-discuss mailing list
> fm-discuss at opensolaris.org
--
Garrett D'Amore, Principal Software Engineer
Tadpole Computer / Computing Technologies Division,
General Dynamics C4 Systems
http://www.tadpolecomputer.com/
Phone: 951 325-2134 Fax: 951 325-2191
More information about the fm-discuss
mailing list