2009/387 [Pathname Reparse Points]

Glenn Skinner glenn.skinner at sun.com
Fri Jul 10 09:34:26 PDT 2009


I'm sponsoring the following fast track for Afshin Salek and the CIFS
i-team.  It times out on Friday, July 17th.

A copy of the specification below appears in the case directory under
the name "specification".

I've pre-reviewed it and will give it a +1 up front.

		-- Glenn

----------------

Template Version: @(#)onepager.txt 1.35 07/11/07 SMI
Copyright 2007 Sun Microsystems

1. Introduction
   1.1. Project/Component Working Name:
        Support for Reparse Points

   1.2. Name of Document Author/Supplier:
        Author: Afshin Salek

   1.3. Date of This Document:
        07/08/09
	
   1.4. Name of Major Document Customer(s)/Consumer(s):
        PSARC
	CIFS team

   1.5. Email Aliases:
    	1.5.1. Responsible Manager: Barry.Greenberg at Sun.COM
    	1.5.2. Responsible Engineer: Afshin.Ardakani at Sun.COM
    	1.5.3. Marketing Manager:
	1.5.4. Interest List: cifs-team at sun.com

   A patch binding is requested for this change.	

4. Technical Description:
    4.1. Details:

       INTRODUCTION
	  
	 There are situations where a mechanism is needed to reflect
	 the concept that data is not present at a particular path, but
	 can be found in some alternate location(s).  Examples include
	 "referrals" used to build unified name spaces in NFSv4.x and
	 SMB, and data relocation in HSM systems.  A "reparse point" is
	 defined as the marker for a namespace redirection and a
	 container for the metadata to specify where the target of this
	 redirection is.
	  
	 Reparse points are intended to be a general mechanism for
	 location redirection and as such the file system that contains
	 them is not cognizant of the reparse point format or content.
	 Services that use reparse points know how to interpret and use
	 the stored data.
	  
       REPARSE POINT OBJECT
	  
	 After a lot of discussion the consensus is that the best way
	 to represent reparse points in the file system, in order to
	 minimize the effect on existing applications and utilities, to
	 use symbolic links.  One of the main goals in this context has
	 been the ability to use existing utilities for backup/restore
	 and also ZFS send/receive without having to modify them to
	 know how to deal with reparse points.

	 Some of what is envisioned here could be done with extensions
	 to the Solaris automounter capability.  Part of the
	 motivation, though, is to create centrally-administrated
	 namespaces served by a group of fileservers to near-zero-admin
	 clients.  It is expected to be easier to keep the namespaces
	 uniform if only a small number of servers need to participate.
	 HSM solutions would also normally be tied closely to a storage
	 server by this mechanism.  Also, for both NFS and SMB
	 referrals, it is the client that chooses the target and not
	 the server.  The server only provides the targets' information
	 and it is up to the client to pick the desirable target to
	 access the data.

	 To distinguish a regular symlink from a reparse point, an
	 extensible system attribute will be set on the symlink.  This
	 system attribute is only one bit which indicates whether or
	 not a symlink contains reparse data.
	  
	 The reparse data will be stored as the link target.  The
	 reparse data is not in file system path format, which is the
	 typical format of a link target.  In order to avoid coming up
	 with a totaly new format for reparse data as the link target
	 we decided to adopt the format used by magic links in BSD:
	 (http://www.daemon-systems.org/man/symlink.7.html)
	  
	 @{REPARSE@{service-type1:data} [@{service-type2:data}]...}
	  
	 Where some examples of service-type are:
       
	 #define REPARSE_SVC_SMB	"SMB"
	 #define REPARSE_SVC_NFS	"NFS"
	 #define REPARSE_SVC_HSM	"HSM"
	  
	 The data for each service will be in string format, which is
	 expected to be typically a UUID string.

	 The pattern above starts with "REPARSE" to distinguish it from
	 a other magic links, such as those supported by BSD.  Note
	 that this case is not a proposal to support BSD magic links,
	 the intent is to avoid precluding the future addition of full
	 BSD magic link support.
	  
	 Multiple services entries can co-exist within the symlink
	 data.  It is expected that normally, all entries would resolve
	 to the same logical location, e.g.  NFS and CIFS clients would
	 find the same files.
	  
       BASIC INTERFACES
	  
	 There is a need for both userspace and kernel APIs to work
	 with reparse points.
	  
       Userspace API
	  
	 In userspace the symlink(2) system call will be used to set a
	 reparse point.  The readlink(2) system call will be used in
	 turn to read the reparse data.
	  
       Kernel API
	  
	 In the kernel, VOP_SYMLINK and VOP_READLINK will be used to
	 set/get reparse data.
	  
	 These interfaces will support all replication, archive and
	 copy operations to preserve reparse points without further
	 changes.
	  
	 fop_symlink() needs to be modified to recognize the reparse
	 @{REPARSE} tag and pass the appropriate attribute (i.e.
	 reparse system attribute) to VOP_SYMLINK to be set on the
	 symlink.
       
       IMPLEMENTATION OBSERVATIONS
	  
	 VFS feature registration can be used to determine whether or
	 not a file system supports reparse points.
	  
	 Two things are needed to obtain the reparse point data in the
	 kernel.  First, the consumer needs to know that a reparse
	 point has been encountered and, second, it needs the vnode
	 pointer to the symlink.  The proposal is to enhance VOP_LOOKUP
	 to return the attributes of the looked up vnode.  This way
	 when the vnode is available the caller can check the
	 attributes to determine if the returned vnode is a reparse
	 point or a regular symlink.  Here are the old and revised
	 signatures of VOP_LOOKUP:

	 int VOP_LOOKUP(vnode_t *dvp, char *nm, vnode_t **vpp,
	      pathname_t *pnp, int flags, vnode_t *rdir, cred_t *cr,
	      caller_context_t *ct, int *deflags, pathname_t *ppnp)

	 int VOP_LOOKUP(vnode_t *dvp, char *nm, vnode_t **vpp,
	      pathname_t *pnp, int flags, vnode_t *rdir, cred_t *cr,
	      caller_context_t *ct, int *deflags, pathname_t *ppnp,
	      vattr_t *vap)
	  
	 A vattr_t pointer argument is added at the end to return the
	 attributes if it is non-NULL.  This is an optimization so that
	 consumers don't have to invoke an extra VOP_GETATTR after
	 lookup for obtaining the attributes.

	 The symlink target size should be increased to 16K to
	 accomodate the maximum size supported for MS-DFS referrals by
	 Windows.  Applications are expected to query the PATH_MAX and
	 SYMLINK_MAX values on the local system using
	 pathconf(2)/fpathconf(2).  The value of SYMLINK_MAX would be
	 changed to 16K on ZFS.  The value of PATH_MAX will not be
	 affected.
            
	 To provide compatibility with other UNIXes (see section 6
	 below), sharemgr(1M) would be enhanced to support a "refer"
	 option for NFS exports.  This option would only result in
	 creation of a reparse point at the specified path and does not
	 actually share the path over NFS.
            
	 This case is only about the underlying infrastructure and a
	 future case will be presented to deal with details and
	 specifics of handling referrals for NFSv4 server.

       SECURITY CONSIDERATIONS
            
	 Referrals are similar to regular symbolic links in that they
	 are only pointers to data that could be discovered in some
	 other way.  The presence of such a pointer does not compromise
	 the security of the target object or data; the target service
	 or file system must still enforce security.
            
       OPERATION FLOW
            
	 Once a kernel service encounters a reparse point, it reads the
	 data using VOP_READLINK and passes the data up to a user space
	 daemon (e.g.  reparsed) along with its desired record type.
	 Depending on the requested record type the daemon could simply
	 extract the information from the passed data and return it to
	 kernel or do any other processing necessary to obtain the
	 actual referral information e.g.  in the case of FedFS,
	 contacting NSDB.  Going through a common user space daemon to
	 get the referral data makes this process generic and easily
	 expandable for possible future use cases.
            
	 Referral extraction and creation by a userspace daemon can be
	 handled via a library plugin architecture for different
	 service types.
            
       Operation Flow Example
            
	 Here is a simplified example of operation for a CIFS client
	 that tries to access a file where the path contains a DFS
	 link:
            
	 a) Client tries to access \\srv\root\...\link\...\file.txt
	    where:
	       'root' is a share (namespace root)
	       'link' is a reparse point seen as a folder by client
	  
	 b) CIFS server does a VOP_LOOKUP for 'link' when it is
	    recognized as a reparse point by examining the attributes
	    return by VOP_LOOKUP.  At this point a
	    STATUS_PATH_NOT_COVERED is returned to client
	  
	 c) Client sends a "link referral" request to the server.  CIFS
	    server uses VOP_READLINK to get the 'link' data and sends
	    the data to 'reparsed' daemon via a door call and gets back
	    the DFS link targets in a format understandable by the CIFS
	    client.  The targets are sent back to the client in
	    response to its "link referral" request.
	  
	 b) Client picks one of the targets and contacts the target
	    server to access 'file.txt'
	  
       NFS REFERRAL IN OTHER UNIXES
            
	 FS referrals have been implemented in other major UNIX
	 distributions such as Linux, AIX and HP-UX but there is no
	 unified approach or implementation.

	 Linux, AIX and HP-UX specify referrals as an NFS export
	 option.  The option format is basically the same in all three
	 operating systems (refer=path at host) but the presentation is
	 somewhat different in each case:

	 - In Linux a referral is presented as a mount point.
	 - In HP-UX a referral is a file system partition or logical volume.
	 - In AIX a special object is used to represent a referral.

	 These are all mechanisms to trigger a change in namespace
	 while resolving a path.
      
	 This proposal is somewhat aligned with the AIX approach but
	 does not require a new object type to be defined, which has
	 the advantage of not impacting existing applications.  As
	 mentioned previously, an NFS "refer" option will be supported
	 to provide option format compatibility.
      
	 Additionally, the Solaris requirements include support for
	 both NFS and SMB referrals whereas these other operating
	 systems only support NFS referrals, and they do not provide
	 native SMB support.  For the Solaris operating system, this
	 proposal provides a generic solution to support multiple,
	 disparate referral mechanisms without placing restrictions on
	 the format required by each mechanism.
    
	 The following links provide a bit more details about each OS
	 discussed above:
            
         http://www.citi.umich.edu/projects/nfsv4/linux/using-referrals.html
         http://nfsv4.bullopensource.org/doc/migration-and-replication-0.2.pdf
         http://docs.hp.com/en/5900-0306/ch01s11.html?jumpid=reg_R1002_USEN
         http://docs.hp.com/en/13578/nfsv4_whitepaper.pdf 
         http://publib.boulder.ibm.com/infocenter/systems/index.jsp?topic=/com.ibm.aix.commadmn/doc/commadmndita/nfs_referrals.htm 

 INTERFACE TABLE

                          |Proposed       |Specified   |
                          |Stability      |in what     |
  Interface Name          |Classification |Document?   | Comments
  ===========================================================================
   XAT_REPARSE            |Consolidation  |This        |Reparse extensible
                          |Private        |Document    |attribute
                          |               |            |
   VOP_LOOKUP, fop_lookup |Contracted     |This        |Added new argument:
                          |Consolidation  |Document    |vattr_t *vap 
                          |Private*        |            |
                          |               |            |
   Reparse token syntax   |Committed      |This        |
                          |Private        |Document    |
                          |               |            |
   SYMLINK_MAX            |Committed      |This        |Increased to 16K
                          |               |Document    |

 * The project's deliverables will all go into the OS/NET
   Consolidation, so no contracts are required.

6. Resources and Schedule:

   6.4. Product Approval Committee requested information:
   	6.4.1. Consolidation or Component Name:
	       ON

   6.5. ARC review type:
        FastTrack




More information about the opensolaris-arc mailing list