[fm-discuss] [Fwd: [osol-discuss] Project Proposal: Sensor Abstraction Layer for the Solaris Fault Manager]

Garrett D'Amore garrett at damore.org
Thu Apr 26 14:18:50 PDT 2007


I like the idea of this project.  But I'd like it even better, if it 
could also endeavor to broaden its scope just a little bit by also 
adding in some kind of control features.

In particular, a lot of platforms have fans for cooling.  The 
relationship between fans and sensors is often closely tied.  (For 
example, the fans need to be turned on or have their speed adjusted 
based upon the temperature reported by thermal sensors.  Or, you really, 
really want to let the system administrator know if there is fault in 
one of the fans that could ultimately lead to thermal crisis.)

As another example, you might want to throttle a CPU (instead of 
shutting it off entirely) if thermal sensors peak beyond a certain 
threshold.  At a higher threshold, you might shut it off entirely or 
power off the system.

Right now the picl library and other platform-specific hacks that are 
used to solve this are a little unsatisfactory.  I'd like to see an 
attempt to provide some kind of common central framework for this sort 
of thing.

I'd also like to think about ways to integrate this kind of work with 
the work being done by the battery and power management folks.  For 
example, an overtemp alert on a Li-ION battery  is certainly a situation 
you want to know about!  You might also like to be aware of 
power-related sensors (voltage levels, current flow, etc.)  This is 
important information for the administrator.  While the battery team is 
focused on the mobile market, I think these issues have scope beyond 
it... for example if one of your redundant power supplies on a server in 
your data  center is offline, you really don't want it to go unnoticed.

Just a thought.

    -- Garrett

cindi wrote:
> FYI
>
> ------------------------------------------------------------------------
>
> Subject:
> [osol-discuss] Project Proposal: Sensor Abstraction Layer for the 
> Solaris Fault Manager
> From:
> cindi <cindi at sun.com>
> Date:
> Thu, 26 Apr 2007 14:07:54 -0700
> To:
> opensolaris-discuss at opensolaris.org
>
> To:
> opensolaris-discuss at opensolaris.org
>
> Return-Path:
> <opensolaris-discuss-bounces at opensolaris.org>
> Received:
> from engmail3mpk.sfbay.Sun.COM (engmail3mpk.SFBay.Sun.COM 
> [129.146.11.26]) by jurassic-x4600.sfbay.sun.com (8.14.0+Sun/8.14.0) 
> with ESMTP id l3QL7ekh796197 (version=TLSv1/SSLv3 
> cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO); Thu, 26 Apr 2007 
> 14:07:40 -0700 (PDT)
> Received:
> from sunmail3mpk.sfbay.sun.com (sunmail3mpk.SFBay.Sun.COM 
> [129.146.11.52]) by engmail3mpk.sfbay.Sun.COM 
> (8.13.6+Sun/8.13.6/ENSMAIL,v2.2) with ESMTP id l3QL7epX005597; Thu, 26 
> Apr 2007 14:07:40 -0700 (PDT)
> Received:
> from nwk-avmta-1.SFBay.Sun.COM (nwk-avmta-1.SFBay.Sun.COM 
> [129.146.11.74]) by sunmail3mpk.sfbay.sun.com 
> (8.13.7+Sun/8.13.7/ENSMAIL,v2.2) with ESMTP id l3QL7ab4025600; Thu, 26 
> Apr 2007 14:07:36 -0700 (PDT)
> Received:
> from pmxchannel-daemon.nwk-avmta-1.sfbay.Sun.COM by 
> nwk-avmta-1.sfbay.Sun.COM (Sun Java System Messaging Server 6.2-3.04 
> (built Jul 15 2005)) id <0JH400627IOOTE00 at nwk-avmta-1.sfbay.Sun.COM>; 
> Thu, 26 Apr 2007 14:07:36 -0700 (PDT)
> Received:
> from brmea-mail-2.sun.com ([192.18.98.43]) by 
> nwk-avmta-1.sfbay.Sun.COM (Sun Java System Messaging Server 6.2-3.04 
> (built Jul 15 2005)) with ESMTP id 
> <0JH4002ZWION6S30 at nwk-avmta-1.sfbay.Sun.COM>; Thu, 26 Apr 2007 
> 14:07:35 -0700 (PDT)
> Received:
> from relay24.sun.com (ip192-12-251-74.block6.us.syntegra.com 
> [192.12.251.74]) by brmea-mail-2.sun.com (8.13.6+Sun/8.12.9) with 
> ESMTP id l3QL7Y5J017492; Thu, 26 Apr 2007 21:07:34 +0000 (GMT)
> Received:
> from mms24es.sun.com ([150.143.232.74] [150.143.232.74]) by 
> relay24.sun.com with ESMTP id BT-MMP-1297265; Thu, 26 Apr 2007 
> 21:07:34 +0000 (Z)
> Received:
> from mms24bas.mms.us.syntegra.com (relay24.mms.us.syntegra.com 
> [192.12.251.70]) by mms24es.sun.com with ESMTP id BT-MMP-2183835; Thu, 
> 26 Apr 2007 21:07:34 +0000 (Z)
> Received:
> from mail.opensolaris.org ([72.5.123.71] [72.5.123.71]) by 
> relay24.sun.com with ESMTP id BT-MMP-6179154; Thu, 26 Apr 2007 
> 21:07:34 +0000 (Z)
> Received:
> from oss-mail1.opensolaris.org (localhost [127.0.0.1]) by 
> mail.opensolaris.org (Postfix) with ESMTP id 981BEB2714; Thu, 26 Apr 
> 2007 14:07:31 -0700 (PDT)
> Received:
> from sca-ea-mail-4.sun.com (sca-ea-mail-4.Sun.COM [192.18.43.22]) by 
> mail.opensolaris.org (Postfix) with ESMTP id C06C7B270E for 
> <opensolaris-discuss at opensolaris.org>; Thu, 26 Apr 2007 14:07:28 -0700 
> (PDT)
> Received:
> from jurassic.eng.sun.com ([129.146.108.31]) by sca-ea-mail-4.sun.com 
> (8.13.6+Sun/8.12.9) with ESMTP id l3QL7SAv011507 for 
> <opensolaris-discuss at opensolaris.org>; Thu, 26 Apr 2007 21:07:28 +0000 
> (GMT)
> Received:
> from [192.9.61.4] (punchin-cindi.SFBay.Sun.COM [192.9.61.4]) by 
> jurassic.eng.sun.com (8.13.8+Sun/8.13.8) with ESMTP id l3QL7SrV190116 
> for <opensolaris-discuss at opensolaris.org>; Thu, 26 Apr 2007 14:07:28 
> -0700 (PDT)
> Sender:
> opensolaris-discuss-bounces at opensolaris.org
> Errors-to:
> opensolaris-discuss-bounces at opensolaris.org
> Message-ID:
> <463114AA.5060003 at sun.com>
> MIME-Version:
> 1.0
> Content-type:
> text/plain; charset=ISO-8859-1; format=flowed
> Content-transfer-encoding:
> 7BIT
> Precedence:
> list
> X-BeenThere:
> opensolaris-discuss at opensolaris.org
> Delivered-to:
> opensolaris-discuss at opensolaris.org
> X-PMX-Version:
> 5.2.0.264296
> X-Original-To:
> opensolaris-discuss at opensolaris.org
> X-Mailman-Version:
> 2.1.4
> List-Post:
> <mailto:opensolaris-discuss at opensolaris.org>
> List-Subscribe:
> <http://mail.opensolaris.org/mailman/listinfo/opensolaris-discuss>, 
> <mailto:opensolaris-discuss-request at opensolaris.org?subject=subscribe>
> List-Unsubscribe:
> <http://mail.opensolaris.org/mailman/listinfo/opensolaris-discuss>, 
> <mailto:opensolaris-discuss-request at opensolaris.org?subject=unsubscribe>
> List-Archive:
> <http://mail.opensolaris.org/pipermail/opensolaris-discuss>
> List-Help:
> <mailto:opensolaris-discuss-request at opensolaris.org?subject=help>
> List-Id:
> General OpenSolaris Discussion List <opensolaris-discuss.opensolaris.org>
> User-Agent:
> Thunderbird 1.5.0.10 (Macintosh/20070221)
>
>
> The Project
>
> This project proposes extensions to the fault management architecture 
> (FMA) to support a sensor abstraction layer for the collection and 
> analysis of sensor based telemetry that can be used in fault and 
> resource management.
>
> The Problem
>
> How do we manage raw telemetry data kept, maintained and exported by 
> disparate sources for the purposes of fault, resource management and 
> budgeting?  Today, there are a number of sensor collection mechanisms 
> exported by the hardware and software.  For the most part, the 
> information they export is hap-haphazardly presented and accessed 
> according to ad-hoc operating system interfaces, per-platform methods 
> or per-subsystem industry standards (SMBus, SMART and IPMI).  Using 
> this data for fault or resource management is clumsy and typically 
> requires low-level system knowledge baked into higher-level management 
> applications.
>
> Key Objectives
>
> As part of an overall sensor abstraction layer based on our current 
> fault management architecture, we can solve the problem described in 
> section 1.1 and provide a better understanding of the overall health 
> and usage of a system through more sophisticated diagnosis 
> technologies and fine-grained observability of sensor data via common 
> access methods. A sensor abstraction layer must posses:
>
> 1. the ability to alert the administrator to conditions observed by
>    platform sensors that may impact the operational state of the
>    platform.
>
> 2. the ability to alert the administrator to conditions that resolve
>    themselves as observed by platform sensors.
>
> 3. the ability to watch one or more sensors and correlate the data for
>    predictive fault analysis or resource management.
>
> 4. the ability to continuously record sensor data and retrieve it from
>    systems for offline analysis, future system design or development of
>    more advanced diagnosis algorithms.
>
> 5. the ability for administrators and service personnel to manually
>    inspect sensor values without having to understand the exact
>    implementation (e.g. IPMI or SMBus).
>
> 6. the ability to connect sensor data to higher-level diagnosis (e.g.
>    SMART disk data to SCSI and ZFS diagnosis engines)
>
> 7. the ability to understand and observe performance and power budgets
>    based on raw sensor data.
>
> Cindi
>
> _______________________________________________
> opensolaris-discuss mailing list
> opensolaris-discuss at opensolaris.org
> ------------------------------------------------------------------------
>
> _______________________________________________
> fm-discuss mailing list
> fm-discuss at opensolaris.org




More information about the fm-discuss mailing list