[fm-discuss] [Fwd: [osol-discuss] Project Proposal: Sensor Abstraction Layer for the Solaris Fault Manager]
Cynthia McGuire
cindi at sun.com
Fri Apr 27 13:54:31 PDT 2007
Matty wrote:
> On 4/27/07, cindi <cindi at sun.com> wrote:
>>
>>
>> Bruce Shaw wrote:
>> >>> The Project
>> >>>
>> >>> This project proposes extensions to the fault management architecture
>> >
>> >>> (FMA) to support a sensor abstraction layer for the collection and
>> >>> analysis of sensor based telemetry that can be used in fault and
>> >>> resource management.
>> >>>
>> >
>> > Have you looked at some of the current sensor code in net-snmp?
>>
>> Yes. But maybe you're asking about how this project might compliment
>> net-snmp? This project provides the underpinnings for collection and
>> analysis of raw telemetry from various sources (IPMI, SMBus, kstats,
>> SMART, etc...) for which you wouldn't necessarily want to issue an
>> immediate 'alert' or notification. The output of the analysis of the
>> telemetry could send an alert that is messaged to syslog and sent on as
>> an snmp trap much like we do with fault events today and the fmd(1M)
>> snmp agent. The output could also be an ereport sent on to the fault
>> manager for diagnosis of a predicted failure. The fault manager would
>> issue a diagnosis that would also be passed on as a SNMP trap.
>
> Hi Cindi,
>
> Will there be a way for admins to get at the sensors values? I would
> love to have a native "smartctl"-like utility for opensolaris (and if
> the SMART attributes are presented in an easily consumable form, this
> should be easy to write).
Hi Ryan,
The vision is not to have to expose the admins to the raw sensor values
at all but rather an analysis of those values. For example, we could
write a SMART sensor provider that understands the basic access methods
for collecting SMART sensor values at some configurable periodic rate
for drives of interest.
That data is collected, recorded in a 'black box' and dispatched to a
SMART disk analyzer engine that understands, for example, how to
correlate read error rates, flying height and torque amplication counts
(I'm making this up for example sake) for a particular manufacturer's
disk drive. Upon determining that a failure is imminent, the analyzer
engine would generate a FMA ereport that is sent to the fault manager.
The fault manager, in turn, generates a fault diagnosis for that disk
drive. The administrator sees the fault diagnosis or resulting SNMP trap.
On the other hand, sensor values of 'pre-analyzed' data for predictive
failures handled by the disk firmware could also be collected via our
SMART provider. These values are also recorded in the black box but
there is no need for further analysis before we generate an ereport to
the fault manager who issues a diagnosis.
The black box recording or raw telemetry will be available probably as
an extension of fmdump(1M) but you will probably be better served by an
API that allows you to write your own analyzer engine to consume SMART
attribute values. This same API could be used by a disk manufacturer or
anyone with an understanding of how to correlate SMART attributes to
failures or performance impact. This approach permits new algorithms to
be deployed without a reboot to the system, disk firmware upgrade or
disks taken offline. Using the same model we use for the fault manager
diagnosis engines, analyzer engines are configured, loaded or unloaded
dynamically.
I hope that answered your question, it was a bit long-winded.
Cindi
More information about the fm-discuss
mailing list