[osol-discuss] Project Proposal: Fault Management Event Registry
Eric Boutilier
Eric.Boutilier at Sun.COM
Thu Mar 1 07:41:56 PST 2007
Thanks, Cindi. You have seconds. I'll contact you offline to get you
set up.
Eric
On Tue, 27 Feb 2007, cindi wrote:
>
>
> This project will export the fault management registry of event specifications
> and diagnosis article content. The initial delivery to the OpenSolaris
> community will include the registry contents and a set of CLIs
> and a web-based browser tools to access event class and payload specifications,
> diagnosis messages and article details.
>
> The initial target audience of this project is system administrators and
> developers who want a listing of the possible fault diagnosis messages
> and the event class and payload specifications.
>
> For example, the ermsg command may be used to list the diagnosis message IDs
> for all AMD Opteron and Athlon 64 processor diagnoses:
>
> # ermsg -a AMD
> Dictionary Entry No. ID
> AMD 1 AMD-8000-1W
> AMD 2 AMD-8000-2F
> AMD 3 AMD-8000-3K
> AMD 4 AMD-8000-48
> AMD 5 AMD-8000-5M
> AMD 6 AMD-8000-67
> AMD 7 AMD-8000-7U
> AMD 8 AMD-8000-8L
> AMD 9 AMD-8000-9G
> AMD 10 AMD-8000-AV
> AMD 11 AMD-8000-C0
> AMD 12 AMD-8000-DT
> AMD 13 AMD-8000-E6
> AMD 14 AMD-8000-FN
> AMD 15 AMD-8000-G9
> ...
>
> Specific message and article detail content may also be displayed:
>
> # ermsg -a AMD-8000-G9
> Dictionary Entry No. ID
> AMD 15 AMD-8000-G9
>
> CPU errors exceeded acceptable levels
>
> Type
> Fault
>
> Severity
> Major
>
> Description
> The number of errors associated with this CPU has exceeded acceptable levels.
>
> Automated Response
> An attempt will be made to remove this CPU from service.
>
> Impact
> Performance of this system may be affected.
>
> Suggested Action for System Administrator
> Schedule a repair procedure to replace the affected CPU. Use fmdump -v -u
> <EVENT_ID> to identify the module.
>
> Details
> This message indicates that the Solaris Fault Manager has received a report
> from a CPU indicating that an uncorrectable Level 1 Data Translation
> Look-aside Buffer error has occurred, and a CPU fault has been diagnosed.
> System performance may have been affected. Faults of this nature typically
> result in a system reset and reboot.
> ...
>
> Similarly, FMA event class and payload specifications may also be displayed.
>
> # erevent -L "ereport.io.pci.*"
> ereport.io.pci.dpe -- Detected data parity error
> ereport.io.pci.dto -- Master never reissued read
> ...
>
> # erevent -a "ereport.io.pci.dpe"
> ereport.io.pci.dpe -- Detected data parity error
>
> Event Payload
> Name Type Description
> ENA uint64_t Error Numeric Association
> class string The event class
> detector fmri The resource that detected the error
> version uint8_t The major version of this event class
> pci-bdg-cntl uint16_t PCI bridge control register
> pci-command uint16_t PCI Local Bus configuration command register
> pci-pa uint64_t PCI errant physical address
> pci-status uint16_t PCI Local Bus configuration status register
>
> The OpenSolaris event registry source will be regularly updated to coincide
> with updates to message IDs at http:///sun.com/msg. Community contributions
> to the event registry source will be permitted and sponsored for developers
> contributing fault management error handling and diagnosis software for
> hardware and software components that are FMA capable and aware.
>
>
More information about the opensolaris-discuss
mailing list