[celeste-discuss] some questions
Kevin Fox
kev at sun.com
Thu Sep 11 09:51:58 PDT 2008
Thanks for the detailed explanation Glenn. I'm still trying to get my
head around
the architecture/nomenclature, so bear with me. Just to make sure I
understand,
in the current algorithm, N is currently equal to the total number of
replicas
(paranoid-use-case) and writes require that all replicas are updated
prior to succeeding, except for mutable objects which succeed once a
quorum
of replicas have been stored.
Your next putback will allow modifying the number of replicas and the
quorum
size for mutable objects on a per-file basis, and at some point, lazy
replication
policies could be implemented ... but are not currently in the works ...
Is that correct?
Kev
On Sep 11, 2008, at 9:07 AM, Glenn Scott wrote:
> Hi Kevin,
>
> When a file is stored in Celeste, four different kinds of data
> objects are created.
> Blocks (BObject), which contain the actual data of the file, the
> manifest (VObject)
> which records which blocks comprise the file, an "anchor" (AObject)
> object
> which is the top-level representative of the file for its lifetime,
> and one mutable
> object which records the mapping of the AObject to the object-id of
> the current VObject.
>
> The A, V, and B objects are all self-verifying and immutable. None
> of them require a
> quorum interaction to determine the authoritative value. To get the
> content of
> any of the A, V, or B objects it is sufficient to fetch just one.
> Getting or setting the value
> of the mutable object requires a quorum interaction, but it is done
> twice per file operation,
> not N times.
>
> When writing a file, Celeste will make the N copies of the A, V, and
> B objects before
> returning the result of the write to the client. This is how a
> single write turns into effectively
> N writes. The client is assured that when the operation returns,
> the file is stored
> and replicated.
>
> Now, there are clear ways to make improvements here.
>
> The most obvious is to control the value of N and optimise it for
> the Celeste system.
> Some Celeste systems need to be very paranoid, others don't. Using
> a paranoid
> value for N in a system that doesn't need it is wasteful. My next
> checkin has rewritten
> the way we specify the replication parameters giving each file's
> creator control over how
> many copies of each object type to store (and includes the size of
> the quorums for the
> mutable objects).
>
> Another is lazy replication. For example you write a file and
> Celeste makes a single
> copy of each data object and lazily replicates the objects over
> time. The risk is a write
> may succeed from a client's perspective, yet a node that contains
> one of the single
> copies of a data object has a permanent failure before that object
> is replicated.
>
> Now on the matter of how the quorums work, it is the case that a
> write to a mutable
> object succeeds when only a quorum number of replicas of the mutable
> object are stored.
> The quorum mechanism initially creates a set of mutable object
> replicas. The number
> of these replicas is 3f+2b+1, where f is the maximum tolerable
> number of absent/failed
> replicas, and b is the maximum number of malevolent replicas.
>
> A successful write must update at least a quorum number of replicas
> lest a subsequent
> read select a different subset of the replicas and must either work
> with an ambiguous
> value or miss the latest write entirely.
>
> Glenn
>
>
> On Sep 11, 2008, at 7:54 AM, Kevin Fox wrote:
>
>> Pardon my joining the discussion a bit late as I'm a new subscriber.
>> I had a question regarding the following:
>>
>> On Sep 9, 2008, at 11:47, Glenn Scott wrote:
>>
>>> Writes are the slowest because each write turns into about N writes,
>>> where N is the replication factor. Reads are much faster, we are
>>> able
>>> to read files at an aggregate of about 19Mbs. I know this because
>>> we've been playing movies stored in Celeste.
>>
>> Shouldn't a write succeed after a quorum of replications have
>> returned?
>> That would leave the remaining replicas to synchronize at their
>> leisure
>> (i.e. if they were more remote).
>>
>> Thinking about it more, it should be possible to allow less than a
>> quorum of
>> replicas if there were some guarantees about the state of the non-
>> synchronized
>> replicas, but it would seem that a simple quorum would be sufficient.
>> Unless
>> are there some assumptions that the Beehive/Celeste quorum algorithms
>> rely
>> on that do not allow that type of replica relaxation?
>>
>> Kev
>>
>>> From: Glenn.Scott at sun.com (Glenn Scott)
>>> Date: Tue, 09 Sep 2008 11:47:20 -0700
>>> Subject: [celeste-discuss] some questions
>>>
>>> Hi Jure,
>>> Celeste is a fully supported project here in Sun Labs. A long-
>>> termroadmap is still in flux because we are learning new things
>>> about how
>>> people are using Celeste. I know the project still has a "fresh
>>> from the labs" feel, and the first item on a short-term roadmap is
>>> to get
>>> better documentation and interfaces available to make it easier to
>>> use.
>>> I have plans for a at least two new interfaces: XML-RPC and WebDAV.
>>> They are under development now, but I'm not sure when they will be
>>> completed. I am inviting folks to help...
>>>
>>> Our original goal for the system to work in harsh (very harsh)
>>> environments has come at the cost of performance. Celeste is very
>>> "paranoid" about what it does and what it checks, and that comes at
>>> the cost of performance. I realise that we need a way to adjust
>>> this
>>> paranoia to let people use the system in less harsh environments.
>>>
>>> For example, until last weekend, Celeste would sign every write to a
>>> file and check every signature. This is to prevent or detect
>>> replays
>>> of old writes to a file by a malicious node that was previously
>>> involved in a successful write. If that risk is not a concern,
>>> there's really no need to spend the time performing these
>>> computations. So now, you can create files that do not have signed
>>> writes. Some performance improvement there.
>>>
>>> There is a LOT of room for performance improvements, but it will
>>> never perform at a comparable level to a local filesystem.
>>>
>>> Writes are the slowest because each write turns into about N writes,
>>> where N is the replication factor. Reads are much faster, we are
>>> able
>>> to read files at an aggregate of about 19Mbs. I know this because
>>> we've been playing movies stored in Celeste.
>>>
>>> It is possible to specify that data needs to be present at a minimum
>>> number of locations. We need to make this easier to use (it is the
>>> replication-parameter in the command line interfaces). You will
>>> specify the number of copies of objects and what technique to
>>> use. I
>>> will also include the ability to specify things like 1 copy 'nearby'
>>> and 2 'far away' which gives you more control over performance
>>> versus
>>> availability.
>>>
>>> Yes there will definitely be control over which nodes get to join a
>>> Celeste system. We used to have a token device that needed to be
>>> present for a node to join, but that will not scale. The idea now
>>> is
>>> to use the SSL certificate of each permitted node to be present in a
>>> list.
>>>
>>> There is a lot more in store. Our main focus right now has been
>>> better
>>> documentation, interfaces, and control over the tradeoffs between
>>> performance and availability.
>>>
>>> Community members can influence priorities for Celeste development
>>> by
>>> making comments such as yours or -- better yet -- by making
>>> contributions to the system. if someone, for example, wanted to
>>> work
>>> on adding mechanisms to control which nodes can join a Celeste
>>> confederation, we would be happy (indeed, overjoyed) to help that
>>> person along.
>>>
>>> Glenn
>>>
>> _______________________________________________
>> celeste-discuss mailing list
>> celeste-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/celeste-discuss
>
More information about the celeste-discuss
mailing list