zpool failmode property [PSARC/2007/567 FastTrack timeout 10/08/2007]
Gary Winiger
gww at eng.sun.com
Mon Oct 1 22:08:54 PDT 2007
This never made it to the case directory, as I still have it, I'm resending.
Gary..
=======
> From psarc-member-list-request at Sun.COM Mon Oct 1 14:45:21 2007
> Date: Mon, 01 Oct 2007 15:42:48 -0600 (MDT)
> From: Mark Maybee <maybee at moosylvania.central.sun.com>
> Subject: zpool failmode property [PSARC/2007/567 FastTrack timeout 10/08/2007]
> To: PSARC-ext at Sun.COM
> Cc: zfs-eng at Sun.COM
> Content-transfer-encoding: 7BIT
> X-PMX-Version: 5.2.0.264296
> X-PMX-Version: 5.2.0.264296
>
>
> Template Version: @(#)sac_nextcase 1.64 07/13/07 SMI
> This information is Copyright 2007 Sun Microsystems
> 1. Introduction
> 1.1. Project/Component Working Name:
> zpool failmode property
> 1.2. Name of Document Author/Supplier:
> Author: George Wilson
> 1.3 Date of This Document:
> 01 October, 2007
> 4. Technical Description
>
> The stability of this new property is committed, and the release binding
> is patch/micro.
>
> A. SUMMARY
>
> This case adds a new pool-level property, 'failmode', to the existing zpool
> property infrastructure (PSARC 2007/342).
>
> B. PROBLEM
>
> ZFS was designed to panic a system in the event of a catastrophic write failure
> to a pool. From the initial design, non-replicated writes and critical reads
> that could not be satisfied resulted in a system panic. This behavior was
> commonly seen during error injection testing, as administrators would often
> remove all paths to their storage arrays.
>
> Additionally, customer that configured ZFS on large HW RAID arrays, like
> Hitachi, found themselves in a more susceptible situation since these pools
> were typically created from a large single LUNs rather than a series of
> replicated devices. This meant that a single write failure on this type of
> pool would result in a system crash.
>
> C. PROPOSED SOLUTION
>
> This solution introduces a new pool property, "failmode", which allows the
> administrator to determine the recovery behavior in the event of such a
> failure. The property can be set to one of three options: "wait", "continue",
> or "panic".
>
> The default behavior will be to "wait" for manual intervention before
> allowing any further I/O attempts. Any I/O that was already queued would
> remain in memory until the condition is resolved. This error condition can
> be cleared by using the 'zpool clear' subcommand, which will attempt to resume
> any queued I/Os.
>
> The "continue" mode returns EIO to any new write request but attempts to
> satisfy reads. Any write I/Os that were already in-flight at the time
> of the failure will be queued and maybe resumed using 'zpool clear'.
>
> Finally, the "panic" mode provides the existing behavior that was explained
> above.
>
> The syntax for setting the pool property utilizes the "set" subcommand defined
> in PSARC 2006/577:
>
> # zpool set failmode=continue pool
> # zpool create -o failmode=continue pool <vdev>
>
> D. MANPAGE DIFFS
>
> The following text will be added under the "Properties" section:
>
> failmode=continue | wait | panic
>
> Controls the system behavior in the event of catastrophic pool
> failure. This condition is typically a result of a loss of connectivity
> to the underlying storage device[s] or a failure of all devices
> within the pool. The behavior in the event of such an event will
> be determined as follows:
>
> wait Block all I/O access until the device connectivity
> is recovered and the errors are cleared. This is the
> default behavior.
>
> continue Returns EIO to any new write I/O requests but allows
> reads to any of the remaining healthy devices. Any
> write requests that has yet be be committed to
> disk would block.
>
> panic Prints out a messages to the console and generates
> a system crash dump.
>
>
> 6. Resources and Schedule
> 6.4. Steering Committee requested information
> 6.4.1. Consolidation C-team Name:
> ON
> 6.5. ARC review type: FastTrack
> 6.6. ARC Exposure: open
>
>
More information about the opensolaris-arc
mailing list