[fm-discuss] FMA/networking post-UV

Peter Memishian peter.memishian at sun.com
Tue Feb 19 21:35:06 PST 2008


Mike/Cindi,

As you may recall, one of the problems with "style 2" DLPI datalinks is
that opening them bypasses the FMA I/O retire checks in spec_open() (since
the kernel doesn't know what piece of hardware is actually being accessed
until the DL_ATTACH_REQ is done).

However, the /dev/net directory introduced by the recent Clearview UV
putback consists only of "style 1" DLPI links, which the spec_open()
checks correctly catch, causing ENXIO to be returned.  Since all libdlpi
applications check /dev/net first, these style-1 links are now preferred.
>From a RAS standpoint, this is a marked improvement.  However, we've
already encountered a handful of systems with a network device that
apparently mostly worked (even though FMA had retired it) which failed to
open with ENXIO after upgrading to the UV bits.  Of course, once the user
runs "fmadm faulty", everything falls into place -- but to most, the
connection between the ENXIO error and FMA may not occur (especially since
FMA may have done the retire months ago).  I fear this will lead to
support calls and frustration.

As such, I had a few points I wanted your input on:

	1. Has there been any discussion of a new errno for this case?
	   If we had a new errno, such as ERETIRED or EFAULTED, API
	   consumers could differentiate this case if appropriate, and
	   moreover strerror() could say something more helpful than "No
	   such device or address".

	2. It seems uneven to have retired networking hardware but not
	   have anything reported by dladm -- minimally, I'd think it
	   appropriate for show-phys to report this, and (given the
	   severity) maybe show-link as well.  (However, I don't want
	   dladm to impinge on fmadm's duties.)

	3. It worries me that in all the cases we've seen thus far, the
	   fault was "repaired" and never seen again.  Is this common, or
	   is this indicative of bugs in our fault detection code?

If these things have been discussed in the past, pointers are welcome.

Thanks,
-- 
meem


More information about the fm-discuss mailing list