Abandon the use of snapshots in mntfs. [PSARC/2009/352 FastTrack timeout 06/19/2009]
Garrett D'Amore
gdamore at sun.com
Fri Jun 12 07:13:50 PDT 2009
+1.
-- Garrett
Brian Utterback wrote:
> I am sponsoring this fasttrack on behalf of Robert Harris. The
> timeout is set to 06/19/2009. Requested binding is patch.
>
> Template Version: @(#)sac_nextcase 1.68 02/23/09 SMI
> This information is Copyright 2009 Sun Microsystems
> 1. Introduction
> 1.1. Project/Component Working Name:
> Abandon the use of snapshots in mntfs.
> 1.2. Name of Document Author/Supplier:
> Author: Robert Harris
> 1.3 Date of This Document:
> 12 June, 2009
> 4. Technical Description
> 1. Proposal:
>
> Abandon the use of snapshots in mntfs.
>
>
> 2. The Problem:
>
> The contents of /etc/mnttab are created by mntfs on demand.
> mntfs parses the in-kernel mnttab structures to create a
> snapshot that can be used to satisfy subsequent calls to
> read() or ioctl(). The snapshot is stored by the kernel
> within the address space of the process that made the first
> call to read() or ioctl(). The enclosing mapping is removed
> from the calling process's address space by mntfs on last
> close().
>
> The snapshot-in-userland design has a flaw: the kernel cannot
> determine whether or not a close() is a specific process's
> last if the vnode count is greater than 1. This is because
> there is no way to determine whether a count that is greater
> than one has originated from dup(), from fork() or from
> both.
>
> This means that mntfs is unable to ensure that every
> insertion of a mapping into a process's address space is
> paired with a corresponding deletion. Two specific
> manifestations are 6394241, in which a newly-execed process
> has an arbitrary range of its address space unmapped by
> mntfs, and 6813502, in which a process address space is
> entirely consumed by orphaned mappings left behind by mntfs.
>
>
> 3. Solutions:
>
> The most obvious solution seemed, at first, to involve
> storing the snapshot data within the corresponding vnode,
> thereby allowing the existing file system infrastructure to
> free the resources when no longer required. This, however,
> was rejected on account of complications inherent in the
> unprivileged user's resulting ability to allocate and retain
> kernel memory.
>
> The only choice left has been to abandon the use of snapshots
> in their current form. This necessitates some minor changes
> to the behaviour of /etc/mnttab and its API, described in
> mnttab(4) and getmntent(3C).
>
> The current snapshot implementation means that, until a call
> to close() or resetmnttab(), clients reading /etc/mnttab will
> see those resources that were mounted at the time the
> snapshot was created, i.e. at the first read() or ioctl().
> Thus resources that have been unmounted in the intervening
> time will still appear to be present.
>
> With the proposed changes, a process will not see any
> resources that have been unmounted since the first call to
> read() or ioctl(), with one exception: if a call to read()
> terminates in the middle of a line, then the next read() will
> be obliged to consume the remainder of that line, even if the
> corresponding resource has been unmounted in the intervening
> time. This prevents the possibility of seemingly-garbled
> text.
>
> Note that where the remainder of a line is stored for
> possible later consumption, it is kept on the corresponding
> vnode's private structure.
>
>
> 4. Impact:
>
> 4.1 Overview:
>
> The current API includes an ioctl for obtaining the number of
> mounted resources within the snapshot (MNTIOC_NMNTS) and
> another ioctl for obtaining the major and minor numbers for
> these resources (MNTIOC_GETDEVLIST). The first ioctl is used
> to obtain the size of an array to pass to the second ioctl.
>
> Following the proposed changes, MNTIOC_NMNTS will return the
> number of resources currently mounted by the kernel.
> However, many of the mounted resources are usually hidden;
> they never appear during a read() of /etc/mnttab, and are
> visible to ioctl() only when specifically requested. The
> value returned by MNTIOC_NMNTS will therefore be viewed by
> the majority of consumers as an over-estimate of the number
> of mounted resources. In reality, the value obtained by
> MNTIOC_NMNTS will be defined as the upper-limit on the number
> of mounted resources, and should be used only to determine
> the length of the array passed to MNTIOC_GETDEVLIST.
>
> MNTIOC_GETDEVLIST will, following the proposed changes,
> populate the supplied array with the major and minor
> numbers of only those mouted resources that are
> visible to the user. Typically, hence, this will leave
> many entries in the supplied array undefined. With
> the proposed changes, the MNTIOC_GETDEVLIST ioctl()
> itself will return the number of mounted resources,
> and hence the number of meaningful entries in the
> supplied array. In the current mntfs implementation,
> an ioctl() for MNTIOC_GETDEVLIST does not employ
> its return value for anything other than to indicate
> an error.
>
> In theory, then, this change introduces a backwards
> incompatability: existing code that uses MNTIOC_NMNTS and
> then MNTIOC_GETDEVLIST to obtain the major and minor numbers
> of mounted resources will find that the last entries are
> meaningless. However, MNTIOC_GETDEVLIST has not worked since
> S10 FCS: it now returns nonsense, as described in 6814666.
>
> Implementing the proposed changes calls for additions to the
> zone_t and vfs_t structs. The zone_t will acquire a pointer
> to an avl_tree_t, and the vfs_t will acquire a pointer to a
> newly-defined structure. The purpose is to allow each vfs_t
> to be stored in an AVL tree, sorted by a unique
> high-resolution time. This is to allow rapid location of the
> next available vfs_t in the mnttab table. If its predecessor
> were unmounted then there would be no vfs_next pointer to
> follow, and a linear search would otherwise be required from
> the start of the circularly-linked list.
>
> 4.2 Interface changes:
>
> 1. The MNTIOC_GETDEVLIST command is modified so that the
> calling ioctl() returns the number of mounted resources
> represented in the supplied array, which is the same
> as the number of visible resources mounted on the system.
> This interface will be Uncommitted.
>
> 2. The vfs struct acquires a new member, vfs_mntmeta, which
> is a pointer to a new, private structure with type
> 'struct vfs_mntmeta'. The new member and the private
> structure will constitute a Private interface.
>
> 3. The zone struct acquires a new member, zone_vfstree,
> which is a pointer to an avl_tree_t. The new member
> will constitute a Private interface.
>
>
> 5. Release binding:
>
> Patch.
>
>
> 6. Documentation impact:
>
> Changes to the mnttab(4) and getmntent(3C) man pages:
>
> *** mnttab.old Thu Jun 11 14:40:19 2009
> --- mnttab.new Thu Jun 11 14:38:09 2009
> ***************
> *** 47,66 ****
> IOCTLS
> The following ioctl(2) calls are supported:
>
> ! MNTIOC_NMNTS Returns the count of mounted resources
> ! in the current snapshot in the uint32_t
> ! pointed to by arg.
>
> ! MNTIOC_GETDEVLIST Returns an array of uint32_t's that is
> ! twice as long as the length returned by
> ! MNTIOC_NMNTS. Each pair of numbers is
> ! the major and minor device number for
> ! the file system at the corresponding
> ! line in the current /etc/mnttab
> ! snapshot. arg points to the memory
> ! buffer to receive the device number
> ! information.
>
> MNTIOC_SETTAG Sets a tag word into the options list
> for a mounted file system. A tag is a
> notation that will appear in the
> --- 47,87 ----
> IOCTLS
> The following ioctl(2) calls are supported:
>
> ! MNTIOC_NMNTS Obtains the upper limit on the number
> ! of mounted resources. arg points to a
> ! uint32_t; this will be set to the upper
> ! limit on the number of mounted
> ! resources that will be identified by a
> ! subsequent MNTIOC_GETDEVLIST.
>
> ! MNTIOC_GETDEVLIST Obtains the actual number of mounted
> ! resources, together with their major
> ! and minor numbers. arg points to an
> ! array of uint_ts that must be at least
> ! twice as long as the length obtained by
> ! MNTIOC_NMNTS. The array will contain a
> ! pair of numbers for each mounted
> ! resource, comprising its major and
> ! minor numbers.
>
> + A resource will not be represented in
> + the array if it was mounted after the
> + preceding MNTIOC_NMNTS command. It is
> + an error to use MNTIOC_GETDEVLIST
> + without having first used MNTIOC_NMNTS.
> +
> + The number of mounted resources actu-
> + ally represented in the array will be
> + returned by the call to ioctl() itself.
> + The values of any remaining elements of
> + the array are undefined.
> +
> + A process that has used either
> + MNTIOC_NMNTS or MNTIOC_GETDEVLIST must
> + call resetmnttab(3C) before
> + getmntent(3C), getextmntent(3C) or
> + getmntany(3C).
> +
> MNTIOC_SETTAG Sets a tag word into the options list
> for a mounted file system. A tag is a
> notation that will appear in the
> ***************
> *** 101,109 ****
> location.
>
> EINVAL The tag specified in a MNTIOC_SETTAG call
> ! already exists as a file system option, or
> ! the tag specified in a MNTIOC_CLRTAG call
> ! does not exist.
>
> ENAMETOOLONG The tag specified in a MNTIOC_SETTAG call is
> too long or the tag would make the total
> --- 122,132 ----
> location.
>
> EINVAL The tag specified in a MNTIOC_SETTAG call
> ! already exists as a file system option, the
> ! tag specified in a MNTIOC_CLRTAG call does
> ! not exist or a request for MNTIOC_GETDEVLIST
> ! was made without a prior request for
> ! MNTIOC_NMNTS.
>
> ENAMETOOLONG The tag specified in a MNTIOC_SETTAG call is
> too long or the tag would make the total
> ***************
> *** 144,156 ****
> ments.
>
> NOTES
> ! The snapshot of the mnttab information is taken any time a
> ! read(2) is performed at offset 0 (the beginning) of the
> ! mnttab file. The file modification time returned by stat(2)
> ! for the mnttab file is the time of the last change to
> ! mounted file system information. A poll(2) system call
> ! requesting a POLLRDBAND event can be used to block and wait
> ! for the system's mounted file system information to be dif-
> ! ferent from the most recent snapshot since the mnttab file
> ! was opened.
>
> --- 167,204 ----
> ments.
>
> NOTES
> ! During a call to read(2) of /etc/mnttab, the corresponding
> ! in-kernel information cannot change. However, it will do so
> ! between successive calls to read(2) if, for example,
> ! resources are unmounted. The underlying file system, mntfs,
> ! implements two features to ensure that /etc/mnttab will con-
> ! tain sensible data even if there are changes to the in-
> ! kernel table of mounted resources.
> !
> ! Firstly, if a call to read(2) terminates only part of the
> ! way through a line, then the next call to read(2) will start
> ! by reading the remainder of the interrupted line, even if
> ! the corresponding resource has been unmounted in the inter-
> ! vening time.
> !
> ! Secondly, successive calls to read(2) will return 0 after
> ! reading the newest resource that was mounted at the time of
> ! the first call to read(2), even if, in the intervening time,
> ! additional resources have been mounted and are still
> ! present.
> !
> ! Following a rewind(3C) of /etc/mnttab, or a call to
> ! resetmnttab(3C), the next call to read(2) will be considered
> ! the first: any saved remainder will be discarded and all
> ! resources mounted at that time are eligible to be read by
> ! subsequent calls to read(2). /etc/mnttab does not support
> ! the use of a file offset for any purpose other than rewind-
> ! ing the file.
> !
> ! The file modification time returned by stat(2) for the
> ! mnttab file is the time of the last change to mounted file
> ! system information. A poll(2) system call requesting a
> ! POLLRDBAND event can be used to block and wait for the
> ! system's mounted file system information to be different
> ! from that at the time of the first read(2) of mnttab.
>
>
> *** getmntent.old Thu Jun 11 14:37:35 2009
> --- getmntent.new Thu Jun 11 14:41:24 2009
> ***************
> *** 40,51 ****
>
> Each getmntent() call causes a new line to be read from the
> mnttab file. Successive calls can be used to search the
> ! entire list. The getmntany() function searches the file
> ! referenced by fp until a match is found between a line in
> ! the file and mpref. A match occurs if all non-null entries
> ! in mpref match the corresponding fields in the file. These
> ! functions do not open, close, or rewind the file.
>
> getextmntent()
> The getextmntent() function is an extended version of the
> getmntent() function that returns, in addition to the infor-
> --- 40,58 ----
>
> Each getmntent() call causes a new line to be read from the
> mnttab file. Successive calls can be used to search the
> ! entire list, although mnttab entries added by the kernel
> ! after the first call to getmntent() will be ignored. Follow-
> ! ing a call to resetmnttab(), the next call to getmntent()
> ! will be considered the first: all resources mounted at that
> ! time will be eligible to be read by subsequent calls to
> ! getmntent().
>
> + The getmntany() function searches the file referenced by fp
> + until a match is found between a line in the file and mpref.
> + A match occurs if all non-null entries in mpref match the
> + corresponding fields in the file. These functions do not
> + open, close, or rewind the file.
> +
> getextmntent()
> The getextmntent() function is an extended version of the
> getmntent() function that returns, in addition to the infor-
> ***************
> *** 53,63 ****
> of the mounted resource to which the line in mnttab
> corresponds. The getextmntent() function also fills in the
> extmntent structure defined in the <sys/mnttab.h> header.
> ! For getextmntent() to function properly, it must be notified
> ! when the mnttab file has been reopened or rewound since a
> ! previous getextmntent() call. This notification is accom-
> ! plished by calling resetmnttab(). Otherwise, it behaves
> ! exactly as getmntent() described above.
>
> The data pointed to by the mnttab structure members are
> stored in a static area and must be copied to be saved
> --- 60,67 ----
> of the mounted resource to which the line in mnttab
> corresponds. The getextmntent() function also fills in the
> extmntent structure defined in the <sys/mnttab.h> header.
> ! Otherwise, it behaves exactly as getmntent() described
> ! above.
>
> The data pointed to by the mnttab structure members are
> stored in a static area and must be copied to be saved
> ***************
> *** 77,89 ****
> sition purposes.
>
> resetmnttab()
> ! The resetmnttab() function notifies getextmntent() to reload
> ! from the kernel the device information that corresponds to
> ! the new snapshot of the mnttab information (see mnttab(4)).
> ! Subsequent getextmntent() calls then return correct
> ! extmnttab information. This function should be called when-
> ! ever the mnttab file is either rewound or closed and reo-
> ! pened before any calls are made to getextmntent().
>
> RETURN VALUES
> getmntent() and getmntany()
> --- 81,91 ----
> sition purposes.
>
> resetmnttab()
> ! The resetmnttab() function causes the next call to
> ! getmntent(), getextmntent() or getmntany() to behave as
> ! though /etc/mnttab had just been opened. In addition, this
> ! function will have a similar effect on read(2); see
> ! mnttab(4) for more details.
>
> RETURN VALUES
> getmntent() and getmntany()
>
>
> 7. References:
>
> 1. CR 6394241 mntfs is not exec safe
>
> 2. CR 6813502 mntfs will leak mappings when called from a forking MT program.
>
> 3. CR 6814666 MNTIOC_GETDEVLIST produces nonsense
>
> 6. Resources and Schedule
> 6.4. Steering Committee requested information
> 6.4.1. Consolidation C-team Name:
> ON
> 6.5. ARC review type: FastTrack
> 6.6. ARC Exposure: open
>
>
More information about the opensolaris-arc
mailing list