Multiplexed I/O Enhancements to Support FMA [PSARC/2008/077 FastTrack timeout 02/13/2008]

Christopher Horne cth at sac.sfbay.sun.com
Fri Feb 1 13:55:27 PST 2008


I am sponsoring the following fasttrack for myself, requesting micro/patch
binding and a timeout of 2/13/2008.

-Chris


Template Version: @(#)sac_nextcase 1.64 07/13/07 SMI
This information is Copyright 2008 Sun Microsystems
1. Introduction
    1.1. Project/Component Working Name:
	 Multiplexed I/O Enhancements to Support FMA

    1.2. Name of Document Author/Supplier:
	 Author:  Chris Horne

    1.3  Date of This Document:
	01 February, 2008

4. Technical Description

   4.1	Problem

	This fasttrack covers new Multiplexed I/O [1] (MPXIO) related
	interfaces to support storage FMA efforts. FMA needs the
	following:

	A)  When a command associated with a scsi_pkt(9S) completes, it
	    must contain information about what physical hardware path
	    was involved in processing the request.

	B)  If an ereport is generated, it must contain a dev scheme
	    FMRI representation of the physical hardware path that
	    persists across reboot, and is independent of mpxio
	    enable/disable.

	C)  If an ereport class is associated with a storage device
	    acting as an error detector, the ereport must contain
	    path-independent device identity information (devid). For
	    such device-as-detector errors, a diagnostic engine
	    (eversholt) is expected to map the ereport to the fmd(1M)
	    libtopo storage topology [11] using the devid.

	D)  If an ereport class is associated with a transport failure,
	    fma code must be able to model the configuration and have
	    APIs available to further explore the fault boundaries
	    related to specific paths to storage.

	Addressing these problems involves changes to the following
	areas of the ON code base: libdevinfo(3LIB), scsi_pkt(9S),
	uscsi(7i), mdi(9F), scsi_vhci(7d), fmd(1M),
	ddi_fm_ereport_post(9F), and fmdump(1M).

   4.2	Proposal

	The topology of a given storage configuration is the same
	independent of whether mpxio is enabled or disabled.
	Supporting a device under mpxio effects the interfaces used to
	discover, report, and explore the topology, but the fundamental
	topology itself does not change. With this in mind, it is
	important to be able to 'name' things independent of mpxio.
	For paths, this can be done by recognizing that an mdi pathinfo
	node, and its libdevinfo path node counterpart, can be
	represented by the same path string that would be used if the
	device was enumerated using a devinfo node.

	This proposal defines the string representation of a path to a
	pathinfo node, introduces the concept of "path_instance"
	associated with these paths, and provides new APIs to expose
	and utilize these two concepts.

	The proposal also addresses some inconsistencies in the
	libdevinfo(3LIB) path node interfaces, and seeks to promote the
	'path' node interfaces defined by [1] and 'devlink' libdevinfo
	interfaces defined by [6,13,14] from consolidation private to
	evolving.

	The proposal also covers a minor enhancement to the eversholt
	language and run-time environment as well as a new output
	filter option for fmdump(1M).

      4.2.1 Path String and Path Instance

	Each devinfo node in the device tree can be uniquely
	represented by a path_string, with a separate
	"/node_name[@unit_addr]" path_string component for each level
	in the tree. For each devinfo node that succeeds attach(9E),
	the I/O framework persists a unique "instance to path_string"
	mapping (/etc/path_to_inst file, includes driver name too).

	A pathinfo node is a peer of a devinfo node, but currently
	lacks the formal definition and supporting interfaces to make
	this obvious. Just like a devinfo node, a pathinfo node has a
	"node_name" (inherited from the mpxio 'client' node) and a
	"unit_address" (_bus_addr). Given this, just like a devinfo
	node, a pathinfo node can have a path_string representation and
	a unique path_instance associated with the path_string.

	For a given device, the path_string representation of a
	pathinfo node is identical in value to the devinfo path_string
	had the device *not* been enumerated under MPXIO.

	The path_instance mechanism proposed will persist the
	"path_instance to path_string" mapping across the
	destruction/recreation of a pathinfo nodes (and across
	detach(9E)/re-attach(9E) of any parent) - keeping the
	path_instance value valid beyond any locking scope. However,
	unlike a devinfo node instance, a path_instance does not
	persist across reboot. The path_instance of a path node is
	available in a di_init(3DEVINFO) snapshot, and can be used to
	direct device access down a specific path using the proposed
	uscsi(7I) extensions.

	The fact that the devinfo/pathinfo path-string representation
	remains the same independent of MPXIO enable/disable, and that
	it provides a physical hardware orientation make it an ideal
	enhancement to current dev scheme FMRI representation used in
	ereports.

	NOTE: While a pathinfo node is similar to a devinfo node
	relative to path-string representation, there are also many
	distinguishing features such as:  a driver does not bind to a
	pathinfo node, and a pathinfo node does not have minor_nodes
	associated with it.

	NOTE: Past issues around different unit-address representation
	for devinfo.vs.pathinfo have been resolved (4953227,6274205).

	NOTE: If needed, the path_instance mechanism could be extended
	in the future to apply to devinfo nodes too - providing a way
	to get the ddi_devpathname() of a device independent of the
	state of the devinfo node. One motivation for doing this would
	be if the 1K of stack space currently used
	ddi_fm_ereport_post(9F) in the nosleep code path is ever poses
	a stack-overflow problem. A devinfo path_instance would remove
	the need for that 1K (MAXPATHLEN) stack space. In addition the
	path_instance could be extended to keep both device path and
	device identity - so that different devices seen at the same
	path end up with a different device_instance. This is would be
	useful for device paths that don't include identity information
	in the unit-address of the final path component. A change in
	the path_instance of a node might indicate the need to fence
	the access until the identity change is properly coordinated.

      4.2.2 Problems with scsi_pkt(9S)

	The scsi_pkt(9S) structure is the primary mechanism used for
	communication between a SCSA target driver and a SCSA HBA
	driver. This proposal extends the scsi_pkt(9S) structure to
	include path_instance.

	From the target driver perspective: at command issue time a
	non-zero pkt_path_instance requests a command to be sent down a
	specific path (zero indicates vHCI selects a path), and at
	pkt_comp() time the pkt_path_instance communicates which path
	in fact used selected. This means that the pkt_path_instance at
	pkt_comp() time can be used to generate the FMRI of the actual
	hardware path used. At command issue time the pkt_path_instance
	field allows the target driver to select a specific path: one
	of its own choosing, or the specific path indicated by a user
	level application using the uscsi(7I) extensions proposed.

	The DDI does not allow a driver to allocate it's own
	scsi_pkt(9S), a driver should not have *any* compiled in
	dependencies on "sizeof (struct scsi_pkt)": a driver that
	violates these rules limits SCSA's ability to evolve. The
	scsi_pkt allocation rules have been in place for many (>10)
	years, unfortunately a significant number of drivers are still
	broken - making the scsi_pkt structure difficult to extend.

	As part of this work SCSA will be enhanced to detect HBA
	drivers with scsi_pkt(9S) allocation violations - printing a
	message like

	  WARNING: mpt: violates DDI scsi_pkt(9S) allocation rules

	for each driver found in violation (one per boot).

	CRs were filed against some broken HBA drivers, a long time ago
	(based on code inspection). Many of these CRs were just closed
	without ever addressing the problem (and some of the drivers
	may be EOLed at this point).

	http://monaco.sfbay.sun.com/detail.jsf?cr=5039931,5039932,5039934,5039935,5039936,5039937,5039938,5039941,5039942

	For initial nevada putback, the message above will be displayed
	(once per driver) for debug kernels. If we are unable to get
	enough drivers fixed, we may need to disable the messages
	completely.

	The best way of fixing scsi_pkt allocation violations is to
	change an HBA driver to use the tran_setup_pkt(9E) interfaces
	defined by [3,4]. If this proves difficult, we may need to
	implement a scsi_pkt_size(9F) peer of the buf(9S) biosize(9F)
	interface.

	To implement this a new scsi_pkt_allocated_correctly()
	interface is provided. While HBA drivers are being fixed,
	access to pkt_path_instance must be conditioned by calls to
	scsi_pkt_allocated_correctly(). For maximum flexibility, HBA
	drivers that enumerate under scsi_vhci (fcp, iscsi, mpt, ibsrp)
	be fixed first.


      4.2.3 Uscsi enhancements allow path selection by path_instance

	This proposal enhances uscsi(7I) to support path_instance based
	path steering via a new uscsi_path_instance "struct uscsi_cmd"
	field and a new USCSI_PATH_INSTANCE uscsi_flags bit. To
	preserve the last remaining uscsi_cmd field (uscsi_reserved_5)
	for future expansion, the input-only uscsi_path_instance field
	overlays the current output-only uscsi_resid field. The common
	scsi_uscsi_alloc_and_copyin() interface (6451061) is enhanced
	to only allow USCSI_PATH_INSTANCE operation to the
	scsi_vhci(7D) HBA, and a new scsi_uscsi_initpkt() is defined to
	isolate scsi_pkt_alloced_correctly() use. This is a compatible
	enhancement to uscsi(7I): the uscsi_path_instance field is only
	considered valid if the new USCSI_PATH_INSTANCE bit is set.

	The MPAPI implementation of path steering provided by [2] makes
	sense for MPAPI applications, but its existence does not
	preclude the implementation of other path steering mechanisms.
	The path steering mechanism provided by this case will be
	easier to use for di_init(3DEVINFO) snapshot consumers.


      4.2.4 ddi_fm_ereport_post/fm_dev_ereport_postv() interface enhancements

	The device tree is not represented by a single data structure
	with embedded type information - the way vnodes are. Instead,
	the device tree is composed of a number of different data
	structures linked together: devinfo nodes (directories VDIR),
	minor nodes (files VBLK/VCHR), and pathinfo nodes (a bit like a
	hardlink VLNK).

	The current ddi_fm_ereport_post(9F) interface is limited to
	just the concept of a devinfo node in conjunction with an
	fma_capable driver bound to the node. The proposal extends the
	set of dev scheme *_fm_ereport_post interfaces to cover the
	other types of nodes in the device tree and to better support
	nexus-child relationships.

	Nexus driver interfaces (ndi_*) are all private. When exposure
	is necessary, nexus concepts are abstracted via DDI
	interfaces. For example the scsi_device(9S) data structure is
	an abstraction of a leaf target-driver devinfo child below a
	SCSA HBA nexus driver.

	For fma, we need basic building-block ddi_fm_*() interfaces
	that can be leveraged when adding fma support to abstractions.
	This proposal delivers a private fm_dev_ereport_postv()
	'va_list' building-block interface, and converts
	ddi_fm_ereport_post() to use fm_dev_ereport_postv(). The
	fm_dev_ereport_postv() interface can support both ddi_, ndi_,
	and abstracted callers. The abstracted callers may understand
	pathinfo node operations and device identity.

      4.2.5 Eversholt

	This fasttrack also adds a minor feature to the "Eversholt"
	language used to describe fault trees [8,9]. The change is
	required for disk FMA work [11] and will not cause any
	compatibility issues. The stability of the Eversholt language
	remains Sun Private and the release binding for the changes
	described here is micro/patch.

	On approval, new version of the "Eversholt Language Manual"
	will be available [9]. The change adds a new optional property
	to Section 2.3.1.5 "Error Report Events" called
	'discard_if_config_unknown', with the following additions to
	the associated table and text:

	  Property		    Required or		Allowed
				    Optional		Types

	  discard_if_config_unknown Optional		Integer


	  The 'discard_if_config_unknown' property, when given a
	  non-zero value, tells the run-time environment that a failure
	  to associate an event with the current configuration should
	  result in the event being silently discarded.

	In addition to the Eversholt language change, the fmd eversholt
	run-time environment is enhanced to support the new property,
	and to implement ereport-configuration match based on devid.

	These changes allow kernel generated ereports that include
	devid information to map to the configuration topology based on
	devid. An ereport defined with 'discard_if_config_unknown' that
	fails to find a configuration mapping will silently be
	discarded. This allows us to support structured error logs
	(fmdump -e) on machine configurations where the full storage
	topology is not yet represented.

      4.2.6 Prtconf output

	Prtconf output is changed to show path_instance and path_string
	(+ output below).

	    Paths from multipath bus adapters:
          +   Path 3: /pci at 8,600000/pci at 1/SUNW,qlc at 5/fp at 0,0/ssd at w50020f230000826a,4
              fp#6 (online)
		...
          +   Path 10: /pci at 8,600000/pci at 1/SUNW,qlc at 4/fp at 0,0/ssd at w50020f230000826a,4
              fp#4 (online)
		...
          +   Path 17: /pci at 8,700000/SUNW,emlxs at 3,1/fp at 0,0/ssd at w50020f230000826a,4
              fp#2 (online)
		...
          +   Path 24: /pci at 8,700000/SUNW,emlxs at 3/fp at 0,0/ssd at w50020f230000826a,4
              fp#0 (online)
		...


      4.2.7 fmdump filter enhancement

	The fmdump(1M) CLI introduced by [12] is enhanced to provide a
	new '[-n name[.name]*[=value]' output filter option. This
	filter option works on both fault logs (fmdump) and error logs
	(fmdump -e). It filters based on ereports having properties
	with the specified name as well as that property having the
	specified value. For string properties the value can be a
	regular expression. Support for embedded nvlist property
	filtering is provided by using a name that that crosses
	multiple levels, each level is separated by a '.'. Examples:

  	  # fmdump -e | wc -l
          5
	  # fmdump -en detector.devid | wc -l
	  4
	  # fmdump -en \
	    'detector.devid=id1,sd at THITACHI_DK32EJ-36NC_____433H8282' | wc -l
	  2
	  # fmdump -en 'detector.devid=.*HITACHI.*' | wc -l
	  2

   4.4	Interface Tables

        ------------------------------------------------------------------------
	INTERFACES BEING REMOVED     Old
        Interface                    Level	Comments
        ------------------------------------------------------------------------
	libdevinfo(3LIB):
	  di_path_phci_path          Cons.Priv. ARCed and defined in
						libdevinfo.h, but never
						implemented. It is
						better to match
						di_devfs_path()
						structure with new
						di_path_devfs_path()
						below

	  di_path_client_path        Cons.Priv. same, see
						di_path_client_devfs_path()
						below.

	  di_path_addr		     Cons.Priv. Currently unused.
						Mismatch with di_bus_addr()
						peer. See di_path_bus_addr()
						below.

        ------------------------------------------------------------------------
	INTERFACES BEING RENAMED     Old Level	Comments
	AND PROMOTED                 New Level
	Interface                     
        ------------------------------------------------------------------------
	libdevinfo(3LIB):
	  (old)di_path_next_phci->   Cons.Priv. Confusing name: caller
	  (new)di_path_client_next_path         starts with a *client*
				     Evol.	dip (not a phci dip)
						and iterates through
						paths associated with
						*client*. Also, a
						client can have
						multiple paths
						associated with the
						same phci.

	  (old)di_path_next_client-> Cons.Priv. Confusing name: caller
	  (new)di_path_phci_next_path           starts with a *phci*
				     Evol.      dip (not a client) and
						iterates through paths
						associated with phci.

        ------------------------------------------------------------------------
	INTERFACES BEING PROMOTED 
        Interface			Level	Comments
        ------------------------------------------------------------------------
	libdevinfo(3LIB): (from PSARC/1999/647 [1]: Cons.Priv. -> Evol.):

	   DINFOPATH			Evol.	di_init() flag to snapshot path

	   					get . associated with path
	   di_path_client_node		Evol.	.client
	   di_path_phci_node		Evol.	.phci

	   					get path property as . array
	   di_path_prop_bytes		Evol.	.byte
	   di_path_prop_int64s		Evol.	.int64
	   di_path_prop_ints		Evol.	.integer
	   di_path_prop_strings		Evol.	.string

	   					look for property as . array
	   di_path_prop_lookup_bytes	Evol.	.byte
	   di_path_prop_lookup_int64s	Evol.	.int64
	   di_path_prop_lookup_ints	Evol.	.integer
	   di_path_prop_lookup_strings	Evol.	.string

	   di_path_prop_name		Evol.	get name of path property
	   di_path_prop_type		Evol.	get type of path property

	   di_path_prop_next		Evol.	walk path properties

	   di_path_state		Evol.	get path state
	   DI_PATH_STATE_FAULT		Evol.	failed, not currently in use
	   DI_PATH_STATE_OFFLINE	Evol.	not available to rcv/xmit data
	   DI_PATH_STATE_ONLINE		Evol.	available to rcv/xmit data
	   DI_PATH_STATE_STANDBY	Evol.	up, but not currently in use

        ------------------------------------------------------------------------
	INTERFACES BEING PROMOTED 
        Interface			Level	Comments
        ------------------------------------------------------------------------
	libdevinfo(3LIB): (PSARC/2000/310 [13]: Cons.Priv. -> Evol.):
			  (PSARC/2002/239 [14]: Cons.Priv. -> Evol.):

	   di_devlink_init()            Evol.   Obtain a snapshot of
						the devlink database.
	     di_devlink_handle_t	Evol.   opaque handle to snapshot.
	     DI_MAKE_LINK               Evol.   Update /dev
	   di_devlink_fini()            Evol.   Destroy snapshot.

	   di_devlink_walk()            Evol.   Walk links in snapshot.
	     di_devlink_t	        Evol.	opaque handle to devlink.

	   di_devlink_path()		Evol.	Get devlink path.
	   di_devlink_content()		Evol.	Get devlink contents.
	   di_devlink_type()		Evol.	Get devlink type.
	     DI_PRIMARY_LINK            Evol.   devlink to /devices
	     DI_SECONDARY_LINK          Evol.   devlink to devlink

	   di_devlink_dup()		Evol.	Copy a devlink object
	   di_devlink_free()		Evol.	Free a devlink object

        ------------------------------------------------------------------------
	NEW INTERFACES
        Interface                       Level	Comments
        ------------------------------------------------------------------------
	libdevinfo(3LIB):
	   di_path_node_name()          Evol.	di_path_t peer of
						di_node_t oriented
						di_node_name().

	   di_path_bus_addr()	        Evol.	di_path_t peer of
						di_node_t oriented
						di_bus_addr().  Also fixing
						CR6284426 "di_path_addr
						should have its second
						argument removed".

	   di_path_instance()           Evol.	di_path_t peer of
						di_node_t
						di_instance().

	   di_path_devfs_path()         Evol.	di_path_t peer of
						di_node_t
						di_devfs_path().

	   di_path_client_devfs_path()
				        Evol.	di_path_t peer of
						di_node_t
						di_devfs_path().

	   di_path_private_get()        Evol.	di_path_t peer of
						di_node_t
						di_node_private_get().

	   di_path_private_set()        Evol.	di_path_t peer of
						di_node_t
						di_node_private_set().


	   di_lookup_node()          Cons.Priv. di_path_t peer of
						di_node_t
						di_lookup_node().

	   di_lookup_path()          Cons.Priv. di_path_t peer of
						di_node_t
						di_lookup_node().

	   di_devfs_path_match()     Cons.Priv. check to see if two
						/devices paths are the
						same, ignoring any
						generic.vs.non-generic
						node name mismatches.

	uscsi(7I):
	   .uscsi_path_instance      Cons.Priv. new "struct uscsi_cmd" field
						path_instance to send
						command down if
						USCSI_PATH_INSTANCE
						set.

	   USCSI_PATH_INSTANCE       Cons.Priv. new uscsi_flags bit,
						send command down
						specific path

	   scsi_uscsi_initpkt()      Cons.Priv. New uscsi setup
						interface for target
						driver implementing
						uscsi path_instance.

	scsi_pkt(9S):
	   .pkt_path_instance        Cons.Priv. new scsi_pkt(9S) field holding
						path_instance.

	   FLAG_PATH_INSTANCE        Cons.Priv. new pkt_flags bit, send
						command down specific
						path.

	   FLAG_PATH_INSTANCE_RPT    Cons.Priv. new pkt_flags bit,
						path_instance reports
						path used.

	   scsi_pkt_pathname();      Cons.Priv. Given scsi_pkt and
						scsi_device, return
						/devices path_string
						used to send command.

	mdi:
	   .pi_path_instance         Cons.Priv. pathinfo node's path_instance

	   MDI_SELECT_PATH_INSTANCE  Cons.Priv. New mdi_path_select()
	   					method.

	   mdi_pi_get_path_instance()Cons.Priv. Return path_instance
						given pathinfo.

	   mdi_pi_pathname()         Cons.Priv. pathinfo peer of
						devinfo
						ddi_pathname().

	   mdi_pi_pathname_by_instance()	
				     Cons.Priv. Return path_string
						given path_instance.
						fma:

	   ndi_fm_ereport_post()     Cons.Priv. ndi form of
						ddi_fm_ereport_post(9F).

	   fm_dev_ereport_postv()    Cons.Priv. Common implementation
						code (and used by
						implementation of
						scsi_fm_ereport_post()).

	eversholt:

	   discard_if_config_unknown Private    optional property of
						.esc 'ereport'
						declaration.


	prtconf(1M):

	   prtconf -v (output)       Not.an.    show path_string
				     Interface

	fmdump(1M):

	   fmdump -n name[.name]*[=value]
				        Evol.	-n filter option.

	   fmdump -n (output)        Not.an.
				     Interface


    4.5 References

        [1] Multiplexed I/O Framework
            http://sac.sfbay/PSARC/1999/647
            http://www.opensolaris.org/os/community/arc/caselog/1999/647

	[2] mpxio path steering (MPAPI)
            http://sac.sfbay/PSARC/2006/621
            http://www.opensolaris.org/os/community/arc/caselog/2006/621

	[3] new scsi_hba_tran entry points (scsi_pkt)
	    http://sac.eng.sun.com/PSARC/2005/680/mail
            http://www.opensolaris.org/os/community/arc/caselog/2005/680

	[4] scsa dma enhancement (scsi_pkt)
	    http://sac.eng.sun.com/PSARC/2006/240/mail
            http://www.opensolaris.org/os/community/arc/caselog/2006/240

  	[5] Dev scheme specification - Section 8.4.3
  	    http://fma.eng/documents/engineering/protocol_whtppr.pdf

	[6] libdevinfo reimplementation
	    http://sac.sfbay/PSARC/1997/127/commit.materials/devinfo.pdf

	[7] MDI/pHCI/libdevinfo Extensions for SNIA MPAPI support
	    http://sac.sfbay/PSARC/2005/646

	[8] Eversholt Diagnosis Technology
	    http://sac.eng/PSARC/2003/428

	[9] Eversholt Language Manual (Version 1.5 10/04/06)
	    http://eversholt.central/docs/language/

	[10]Generic Topology for Internal Disks
	    http://sac.sfbay/PSARC/2007/388/mail
	    http://wikihome.sfbay/fma-portfolio/Wiki.jsp?page=2007.016.DiskTopology

	[11]Unified Disk FMA
	    http://wikihome.sfbay/fma-portfolio/Wiki.jsp?page=2007.015.UnifiedDisk

	[12]Solaris Fault Management Daemon
	    http://sac.sfbay/PSARC/2003/089/

	[13]libdevinfo devlinks interfaces
	    http://sac.sfbay/PSARC/2000/310

	[14]Devlink Creation Enhancements
	    http://sac.sfbay/PSARC/2002/239

    4.6 Man page changes

	See materials directory in case directory has information for
	the following new/changed man pages.

	Deliver (Evolving):
	    di_devfs_path.3devinfo.diff
	    di_devlink_dup.3devinfo
	    di_devlink_init.3devinfo
	    di_devlink_path.3devinfo
	    di_devlink_walk.3devinfo
	    di_init.3devinfo.diff
	    di_lnode_private_set.3devinfo.diff
	    di_path_info.3devinfo
	    di_path_next.3devinfo
	    di_path_prop_access.3devinfo
	    di_path_prop_lookup.3devinfo
	    di_path_prop_next.3devinfo
	    fmdump.1m.diff
	    libdevinfo.3lib.diff

	    Eversholt.index.html.diff.txt

	Information in Case directory (Cons.Priv.)	
	    uscsi.7i.diff


6. Resources and Schedule
    6.4. Steering Committee requested information
   	6.4.1. Consolidation C-team Name:
		ON
    6.5. ARC review type: FastTrack
    6.6. ARC Exposure: open




More information about the opensolaris-arc mailing list