[celeste-discuss] some questions

Kevin Fox kev at sun.com
Thu Sep 11 09:51:58 PDT 2008


Thanks for the detailed explanation Glenn.  I'm still trying to get my  
head around
the architecture/nomenclature, so bear with me.  Just to make sure I  
understand,
in the current  algorithm, N is currently equal to the total number of  
replicas
(paranoid-use-case) and writes require that all replicas are updated
prior to succeeding, except for mutable objects which succeed once a  
quorum
of replicas have been stored.

Your next putback will allow modifying the number of replicas and the  
quorum
size for mutable objects on a per-file basis, and at some point, lazy   
replication
policies could be implemented ... but are not currently in the works ...

Is that correct?
Kev

On Sep 11, 2008, at 9:07 AM, Glenn Scott wrote:

> Hi Kevin,
>
>  When a file is stored in Celeste, four different kinds of data  
> objects are created.
> Blocks (BObject), which contain the actual data of the file, the  
> manifest (VObject)
> which records which blocks comprise the file, an "anchor" (AObject)  
> object
> which is the top-level representative of the file for its lifetime,  
> and one mutable
> object which records the mapping of the AObject to the object-id of  
> the current VObject.
>
> The A, V, and B objects are all self-verifying and immutable.  None  
> of them require a
> quorum interaction to determine the authoritative value.  To get the  
> content of
> any of the A, V, or B objects it is sufficient to fetch just one.  
> Getting or setting the value
> of the mutable object requires a quorum interaction, but it is done  
> twice per file operation,
> not N times.
>
> When writing a file, Celeste will make the N copies of the A, V, and  
> B objects before
> returning the result of the write to the client.  This is how a  
> single write turns into effectively
> N writes.  The client is assured that when the operation returns,  
> the file is stored
> and replicated.
>
> Now, there are clear ways to make improvements here.
>
> The most obvious is to control the value of N and optimise it for  
> the Celeste system.
> Some Celeste systems need to be very paranoid, others don't.  Using  
> a paranoid
> value for N in a system that doesn't need it is wasteful.  My next  
> checkin has rewritten
> the way we specify the replication parameters giving each file's  
> creator control over how
> many copies of each object type to store (and includes the size of  
> the quorums for the
> mutable objects).
>
> Another is lazy replication.  For example you write a file and  
> Celeste makes a single
> copy of each data object and lazily replicates the objects over  
> time.  The risk is a write
> may succeed from a client's perspective, yet a node that contains  
> one of the  single
> copies of a data object has a permanent failure before that object  
> is replicated.
>
> Now on the matter of how the quorums work, it is the case that a  
> write to a mutable
> object succeeds when only a quorum number of replicas of the mutable  
> object are stored.
> The quorum mechanism initially creates a set of mutable object  
> replicas.  The number
> of these replicas is 3f+2b+1, where f is the maximum tolerable  
> number of absent/failed
> replicas, and b is the maximum number of malevolent replicas.
>
> A successful write must update at least a quorum  number of replicas  
> lest a subsequent
> read select a different subset of the replicas and must either work  
> with an ambiguous
> value or miss the latest write entirely.
>
> Glenn
>
>
> On Sep 11, 2008, at 7:54 AM, Kevin Fox wrote:
>
>> Pardon my joining the discussion a bit late as I'm a new subscriber.
>> I had a question regarding the following:
>>
>> On Sep 9, 2008, at 11:47, Glenn Scott wrote:
>>
>>> Writes are the slowest because each write turns into about N writes,
>>> where N is the replication factor.  Reads are much faster, we are  
>>> able
>>> to read files at an aggregate of about 19Mbs.  I know this because
>>> we've been playing movies stored in Celeste.
>>
>> Shouldn't a write succeed after a quorum of replications have  
>> returned?
>> That would leave the remaining replicas to synchronize at their  
>> leisure
>> (i.e. if they were more remote).
>>
>> Thinking about it more, it should be possible to allow less than a
>> quorum of
>> replicas if there were some guarantees about the state of the non-
>> synchronized
>> replicas, but it would seem that a simple quorum would be sufficient.
>> Unless
>> are there some assumptions that the Beehive/Celeste quorum algorithms
>> rely
>> on that do not allow that type of replica relaxation?
>>
>> Kev
>>
>>> From: Glenn.Scott at sun.com (Glenn Scott)
>>> Date: Tue, 09 Sep 2008 11:47:20 -0700
>>> Subject: [celeste-discuss]  some questions
>>>
>>> Hi Jure,
>>> Celeste is a fully supported project here in Sun Labs.  A long-
>>> termroadmap is still in flux because we are learning new things
>>> about how
>>> people are using Celeste.  I know the project still has a "fresh
>>> from the labs" feel, and the first item on a short-term roadmap is
>>> to get
>>> better documentation and interfaces available to make it easier to
>>> use.
>>> I have plans for a at least two new interfaces: XML-RPC and WebDAV.
>>> They are under development now, but I'm not sure when they will be
>>> completed.  I am inviting folks to help...
>>>
>>> Our original goal for the system to work in harsh (very harsh)
>>> environments has come at the cost of performance.  Celeste is very
>>> "paranoid" about what it does and what it checks, and that comes at
>>> the cost of performance.  I realise that we need a way to adjust  
>>> this
>>> paranoia to let people use the system in less harsh environments.
>>>
>>> For example, until last weekend, Celeste would sign every write to a
>>> file and check every signature.  This is to prevent or detect  
>>> replays
>>> of old writes to a file by a malicious node that was previously
>>> involved in a successful write.  If that risk is not a concern,
>>> there's really no need to spend the time performing these
>>> computations.  So now, you can create files that do not have signed
>>> writes.  Some performance improvement there.
>>>
>>> There is a LOT of room for performance improvements, but it will
>>> never perform at a comparable level to a local filesystem.
>>>
>>> Writes are the slowest because each write turns into about N writes,
>>> where N is the replication factor.  Reads are much faster, we are  
>>> able
>>> to read files at an aggregate of about 19Mbs.  I know this because
>>> we've been playing movies stored in Celeste.
>>>
>>> It is possible to specify that data needs to be present at a minimum
>>> number of locations.  We need to make this easier to use (it is the
>>> replication-parameter in the command line interfaces).  You will
>>> specify the number of copies of objects and what technique to  
>>> use.  I
>>> will also include the ability to specify things like 1 copy 'nearby'
>>> and 2 'far away' which gives you more control over performance  
>>> versus
>>> availability.
>>>
>>> Yes there will definitely be control over which nodes get to join a
>>> Celeste system.  We used to have a token device that needed to be
>>> present for a node to join, but that will not scale.  The idea now  
>>> is
>>> to use the SSL certificate of each permitted node to be present in a
>>> list.
>>>
>>> There is a lot more in store.  Our main focus right now has been
>>> better
>>> documentation, interfaces, and control over the tradeoffs between
>>> performance and availability.
>>>
>>> Community members can influence priorities for Celeste development  
>>> by
>>> making comments such as yours or -- better yet -- by making
>>> contributions to the system.  if someone, for example, wanted to  
>>> work
>>> on adding mechanisms to control which nodes can join a Celeste
>>> confederation, we would be happy (indeed, overjoyed) to help that
>>> person along.
>>>
>>> Glenn
>>>
>> _______________________________________________
>> celeste-discuss mailing list
>> celeste-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/celeste-discuss
>




More information about the celeste-discuss mailing list