[ha-clusters-discuss] Two node cluster limitations (SC 3.2)

Steve McKinty Steve.McKinty at Sun.COM
Wed Nov 14 08:25:56 PST 2007


Hi,

msl wrote:
> Hello all,
>  Making some tests with SC 3.2, i have faced some situations that are at least, interesting..
> 
>  1- First i shutdown the two nodes, and after that i did try to boot just
> "one" node, and for my surprise, the node did not boot. The message was
> about the "other node is unreachable through this path". I have waited more
> than an hour, to see if was a "timeout" or something, and after that i boot
> the other node too. After that, the cluster becomes online again.

This depends on the order in which you do the shutdown and reboot, and is
to protect against cluster amnesia.

Consider the following sequence:

- Cluster is running with nodes A and B
- Node A shuts down
- Node B continues to run for a time, and some changes might be made.
- Node B is shutdown.
- Node A is restarted alone.

At this point, Node A has no information about what happened in the cluster
while it was down. If Node A were allowed to start alone, the cluster would
come back having apparently 'forgotten' some of what was done by Node B (for
example RG properties changed, etc.). It is not certain that information
*has* been changed, of course, but it is possible, and unless you can be
sure that it has not happened the cluster will not allow Node A to be started
alone. Only by having both nodes up can the necessary resynchronization be
done.

Normally the cluster software uses the shared storage (quorum disk) to
keep track of which nodes were in the cluster at shutdown, so that it
can detect this scenario. It is one of the problems that we have if
we want to create a "shared-nothing" cluster, i.e. one where the shared
data is based on two sets of data that are replicated between nodes
(i.e. with iSCSI or AVS). In that scenario it can be difficult to detect
the potential problem.


Steve




>  
>  2- In other case, i just cut the power off on one node, and for my surprise (again), the other node crash too (reboot). After that, i was thinking "Now what? the node will not boot because of the case (1) above"... but i was wrong, this time the node boot ok.
> 
>  The environment is: Two-node sun cluster 3.2, with just "one" cluster interconnect interface.
> 
>  Testing "evacuate" or "switch" just works. The problem is when i try to simulate "real" failures.
> 
>  So, the questions are:
>   a) The case (1) is fine? How can i fix that in a real world scenario?
>   b) and the case (2)?
>   c) In the above configuration, what i can expect and what i can not expect for a failover/switch back scenarios? I mean, what are the failures that are covered in such configuration? How many servers can crash, there is a order to respect (shutdown)... ?
>   
>  I know that all should be obvious for you, and i think there is a explanation for all that... but, i just want to know to be aware of.
> 
>  Thanks for your time!
> 
>  Leal.
> --
> 
> This message posted from opensolaris.org
> 
> _______________________________________________
> ha-clusters-discuss mailing list
> ha-clusters-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/ha-clusters-discuss


More information about the ha-clusters-discuss mailing list