[fm-discuss] FMA stuff for nexus/framework drivers?
cindi
cindi at sun.com
Fri Dec 14 18:40:33 PST 2007
Hi Garret,
It's nice to see someone tackling another IO subsystem.
You should separate 'faults' from 'errors'. In the FMA, a fault is
defined as something that is broken (and associated with a piece of
hardware) or defective (and associated with a piece of code). What you
you have described below are errors. Errors are symptoms produced
faults. We can use the information captured at the time the error is
detected to work what is broken or defective.
It's kinda like when you go to the doctor with a bunch of symptoms that
you've noticed and ask for a diagnosis. You wouldn't want the doctor to
just reiterate your symptoms back to you. You want the doctor to tell
what's wrong with you. That's what we do with FMA, error information
is captured by the error detectors and fed to a diagnosis engine who
tells us what's broken. For example, a PCI parity error results in a
diagnosis that tells us that a PCI card may be busted and needs to be
replaced.
Sorry for the diatribe but it's important to make sure we're on the same
page.
First thing to do is describe the different types of faulty or defective
components in your subsystem. Something like:
- controller
- sdcard
- firmware (?)
- target
We call these ASRUs (or sometimes resources).
Now think about how each of the error symptoms below can be explained by
one or more faults in your ASRU list. What algorithm would you use
given each possible error or set of errors to diagnose the problem and
answer the question: what's broken?.
Garrett D'Amore wrote:
> First a bit of background. I've developed a framework for SDcard
> drivers called "sda". This supports both host drivers (e.g. "sdhost")
> and target drivers (e.g. "sdcard"). Actually, "sdcard" itself is a
> pseudo-nexus driver like scsa2usb... it allows "sd(7d)" to act as the
> ultimate target for these kinds of memory cards. The full details are
> in PSARC 2007/659 (SDcard Stack Phase I.)
>
> So what I'm trying to figure out is how to "enable" this stuff for FMA.
> (Or, alternatively, get an appropriate waiver. That might not be as bad
> as it sounds... its probably pretty unlikely that that anyone will care
> too much if their SDcard goes south... just remove and reinsert in most
> cases.)
>
> There are several classes of fault that I can imagine occurring:
>
> 1) errors coming from the host's parent. E.g. PCI parity errors, etc.
> I think I understand the docs on how to do this.
Here, I think your nexus or framework need simply call
pci_ereport_post() and the generic PCI diagnosis algorithms should work
out the faulty ASRU (controller).
>
> 2) errors that are specific to the host controller. E.g. an
> over-current error, or a CRC error interrupt on the SD data pins.
These errors sound hardware specific and you may need to define special
diagnosis algorithms but perhaps there are certain classes of errors
that can be diagnosed by a general-purpose algorithm.
>
> 3) errors that only the framework can tell. E.g. the card is requesting
> an illegal voltage change, or the card has failed to generate a
> "relative card address" properly after several attempts. Clearly it
> would be nice if the framework could participate here.
Absolutely. This is where the framework can detect and report errors
(ereport events) and diagnose problems that are common for all
components under its control w/o having to involve your consumers.
Typically, what happens is you develop an error reporting interface (ala
pci_fm_ereport_post()) for errors detected by the framework. You can
use fm_ereport_post() (uts/common/os/fm.c) or ddi_fm_ereport_post()
(uts/common/os/ddifm.c) as the underlying implementation.
ddi_fm_ereport_post() is evolving whereas the interfaces in fm.c are
project private. Think about the ereport classes and event payload
your diagnosis software will need to work out what's wrong and design
the interfaces accordingly.
And just like for 2), you'll need come with the algorithms to do the
diagnosis of these errors and which ASRUs (resources) are faulty.
>
> 4) errors that the target driver can tell. E.g. a target-specific error
> in response to a block transfer. (E.g. an attempt to write a block to a
> protected sector.)
I think you can punt here to the common sd FMA project.
So now, you need to think about how you want to deliver your diagnosis
software. The algorithms can range from simple (map an error to a
fault) to complex. Some errors you may want to feed through serd
engines such that a certain number of errors have to occur before a
fault diagnosis is issued. Other diagnoses may rely upon the occurance
of a particular combination of errors.
In any case, there are two ways to code your diagnosis software. The
first is by writing a set of eft diagnosis rules like you see for PCI or
writing a C-based diagnosis fmd plugin that subscribes to your
particular error reports (ereports).
If most of your diagnoses are simple 1-to-1 mappings of errors to
faults, eft is proabably your best bet. On the other hand, complicated
algorithms can be tricky when using an eft rules set.
>
> What I would like to do is have some help/guidance in figuring out how
> to architect FMA for this kind of solution. I did see PCI support, but
> I'm not finding any other good examples of my kind of framework with FMA
> support. (Notably neither USB nor 1394 frameworks have FMA support.)
> Can anyone offer specific advice or documentation to read? I've read
> the published documentation that I could find, but it seemed pretty
> specific to leaf-drivers, and I'm not sure how to get something liek
> cases #2 and #3 handled properly.
This should be as clear as mud by now. Instructions on how to develop a
diagnosis plugin is described in the fmd PRM (see
ttp://opensolaris.org/os/community/fm). For samples in developing
ereport generation interfaces for your framework, search the OpenSolaris
code for calls to fm_ereport_post(). The final thing you'll need to do
is write a libtopo enumerator to tack on the SD topology (list of ASRU
and resource instances controlled by the sdcard framework). The latest
PRM describes libtopo and how to write an enumerator. There are also
plenty of examples in the source (lib/fm/topo/modules).
As far as your list of deliverables go, they will look something like:
- specification of ereport events for sdcard framework for 3)
- optional specification for controllers for 2)
- ereport generation routine for sdcard framework for 3)
- optional ereport generation routine for controller drivers for 2)
- diagnosis plugin or eft rules for 3) and optionally 2)
- libtopo enumerator for the sdcard topology
Cindi
More information about the fm-discuss
mailing list