[fm-discuss] help locating problems...
Garrett D'Amore
garrett_damore at tadpole.com
Wed Aug 9 21:06:43 PDT 2006
Tim Haley wrote:
>
> And regarding the "code that generates the fault", there isn't one
> particular call in the kernel that generates that fault. Instead code
> in the kernel has generated one or more ereports (short for error
> reports). Those error reports have come from the kernel to the
> user-land fmd. The fmd has recorded them in its error log. An fmd
> plugin, eft, has then coalesced those ereports and decided what fault
> can cause the observed symptoms. Eft has then published something
> called a suspect list, which in this case has one entry, the fault you
> are seeing. The suspect list was broadcast to another plugin, the
> syslog messaging agent, which should have put a summary of this "case"
> into syslog and onto the console.
I already more-or-less gathered this high level description.
What I'm having a hard time is figuring out which code is sending the
named fault. The current fault framework makes it very hard (at least
to me) to identify this by, for example, grepping for
fault.io.pci.device-interr in the framework.
Is there some easy correlation where I can find things like these
strings in a call from (for example) the PCI nexus driver?
It would be really, really helpful to be able to go from a the logged
message to a line of code somewhere. The notion "well, you had a
random PCI fault on this particular piece of hardware" is really useful
if I am a sysadmin and need to replace the hardware. But as an engineer
developing hardware platforms or writing device driver code, this is a
lot less useful.
(In this particular case, we have seen some other platform-wide problems
with ethernet on this particular device -- too many packet drops in
sunvts for example. We don't know why, but I'd really like to be able
to correlate this to what some driver thinks is going wrong, because
then I stand a much better chance of correlating it to a problem that
may exist with, for example, the design of the platform (maybe some kind
of electrical problem or a mis-connected trace or somesuch.)
In the old days, code that just did cmn_err() was a bit easier because
we could grep for specific strings. Now, with the FMA stuff, that
doesn't seem to work anymore.
-- Garrett
>
> That's a really brief description of the fault management architecture.
>
> -tim
>
> On Wed, 9 Aug 2006, Garrett D'Amore wrote:
>
>> Okay, I'm new to FMA. Solaris b44 is reporting the following error:
>>
>> TIME UUID SUNW-MSG-ID
>> Aug 09 10:47:31.0395 35fc7dee-e5b9-6028-d333-cbcd5a272c35 PCI-8000-7J
>> 100% fault.io.pci.device-interr
>>
>> Problem in:
>> hc:///motherboard=0/hostbridge=0/pcibus=1/pcidev=3/pcifn=0
>> Affects: dev:////pci@1f,700000/network@3
>> FRU: hc:///component=MB
>>
>>
>>
>> What I can't figure out easily is how to determine where the code that
>> is generating this fault resides. I'm guessing it is in pcisch, but
>> honestly, I'm a bit at a loss.
>>
>> Also, are there any design documents for FMA available anywhere?
>> Ideally I'd like to have something both helps me figure out problems
>> like this (down to tracking it down to a line of code), and also gives
>> me information so that I know how to start adding code to inject my own
>> errors from the code that I've written. (E.g. how do I play with FMA in
>> an unbundled NIC driver, etc.)
>>
>> --
>> Garrett D'Amore, Principal Software Engineer
>> Tadpole Computer / Computing Technologies Division,
>> General Dynamics C4 Systems
>> http://www.tadpolecomputer.com/
>> Phone: 951 325-2134 Fax: 951 325-2191
>>
>> _______________________________________________
>> fm-discuss mailing list
>> fm-discuss at opensolaris.org
>>
--
Garrett D'Amore, Principal Software Engineer
Tadpole Computer / Computing Technologies Division,
General Dynamics C4 Systems
http://www.tadpolecomputer.com/
Phone: 951 325-2134 Fax: 951 325-2191
More information about the fm-discuss
mailing list