RHCS6: Quorum disk and heuristics -

RHCS: Quorum disk and heuristics

# Tested on RHEL 6

# A quorum disk is usually used as a tie-breaker to determine which node should be fenced
# in case of problems.

# It adds a number of votes to the cluster in a way that a "last-man-standing" scenario
# can be configured.

# The node with the lowest nodeid that is currently alive will become the "master", who
# is responsible for casting the votes assigned to the quorum disk as well as handling
# evictions for dead nodes.

# Every node of the cluster will write at regular intervals to its own block on a
# quorum disk to show itself as available; if a node fails to update its block it will
# be considered as unavailable and will be evicted. Obviously this is useful to
# determine whether a node that doesn't respond over the network is really down or just
# having network problems. 'cman' network timeout for evicting nodes should be set at
# least twice as high as the timeout for evicting nodes based on their quorum disk
# updates.
# From RHEL 6.3 on, a node that can communicate over the network but has problems to
# write to quorum disk will send a message to other cluster nodes and will avoid to
# be evicted from the cluster.

# 'cman' network timeout is called "Totem Timeout" can be set by adding
#   <totem token="timeout_in_ms"/> to /etc/cluster/cluster.conf

# Quorum disk has to be at least 10MB in size and it has to be available to all nodes.

# A quorum disk may be specially useful in these configurations:
#
# - Two node clusters with separate network for cluster communications and fencing.
#   The "master" node will win any fence race. From RHEL 5.6 and RHEL 6.1 delayed
#   fencing should be used instead
#
# - Last-man-standing cluster

# Configuring a quorum disk
# ------------------------------------------------------------------------------------------

# Take a disk or partition that is available to all nodes and run following command
# mkqdisk -c <device> -l <label>:

mkqdisk -c /dev/sdb -l quorum_disk
   mkqdisk v3.0.12.1

   Writing new quorum disk label 'quorum_disk' to /dev/sdb.
   WARNING: About to destroy all data on /dev/sdb; proceed [N/y] ? y
   Initializing status block for node 1...
   Initializing status block for node 2...
   [...]
   Initializing status block for node 15...
   Initializing status block for node 16...

# Check (visible from all nodes)

mkqdisk -L
   mkqdisk v3.0.12.1

   /dev/block/8:16:
   /dev/disk/by-id/ata-VBOX_HARDDISK_VB0ea68140-6869d321:
   /dev/disk/by-id/scsi-SATA_VBOX_HARDDISK_VB0ea68140-6869d321:
   /dev/disk/by-path/pci-0000:00:0d.0-scsi-1:0:0:0:
   /dev/sdb:
           Magic:                eb7a62c2
           Label:                quorum_disk
           Created:              Thu Jul 31 16:57:36 2014
           Host:                 nodeB
           Kernel Sector Size:   512
           Recorded Sector Size: 512

# Scoring & Heuristics
# ------------------------------------------------------------------------------------------

# As an option, one or more heuristics can be added to the cluster configuration.
# Heuristics are tests run prior to accessing the quorum disk. These are sanity checks for
# the system. If the heuristic tests fail, then qdisk will, by default, reboot the node in
# an attempt to restart the machine in a better state.

# We can configure up to 10 purely arbitrary heuristics. It is generally a good idea to
# have more than one heuristic. By default, only nodes scoring over 1/2 of the total
# maximum score will claim they are available via the quorum disk, and a node whose score
# drops too low will remove itself (usually, by rebooting).

# The heuristics themselves can be any command executable by "sh -c".

# Typically, the heuristics should be snippets of shell code or commands which help
# determine a node's usefulness to the cluster or clients. Ideally, we want to add traces
# for all of our network paths, and methods to detect availability of shared storage.

# Adding the quorum disk to our cluster
# ------------------------------------------------------------------------------------------

# On 'luci' management console (first, check the box under Homebase --> Preferences to have
# access to "Expert" mode), go to cluster administration --> Configure --> QDisk, check
# "Use a Quorum Disk" and "By Device Label", and enter the label given to the quorum disk.
# Define a TKO (Times to Knock Out), the number of votes and the interval for the quorum
# disk to be updated by every node.

# The interval (timeout) of the qdisk is by default 1 second. If the load on the system is
# high, it is very easy for the qdisk cycle to take more than 1 second (I'll set it to 3).

# Totem token is set to 10 seconds by default. This is too short in most cases. A simple
# rule to configure totem timeout could be "a little bit" more than 2 x qdiskd's timeout.
# I'll set it to 50 seconds (50000 ms)

# After adding the quorum disk on Luci console, we'll have following entry in our
# /etc/cluster/cluster.conf file:

#        <quorumd interval="3" label="quorum_disk" tko="8" votes="1"/>
#        <totem token="50000"/>

# On the command line, we can run following commands to obtain same result:

ccs -f /etc/cluster/cluster.conf --settotem token=70000
ccs -f /etc/cluster/cluster.conf --setquorumd interval=3 label=quorum_disk tko=8 votes=1

# If we had defined a heuristic in "Heuristics" section, by entering heuristic program
# (I'll use three pings to different servers as heuristics), interval score and tko, we'd
# have the following:

#   <quorumd interval="3" label="quorum_disk" tko="8" votes="1">
#      <heuristic program="/sbin/ping nodeC -c1 -w1" tko="8"/>
#      <heuristic program="/sbin/ping nodeD -c1 -w1" tko="8"/>
#      <heuristic program="/sbin/ping nodeE -c1 -w1" tko="8"/>
#   </quorumd>

# Once our quorum disk is configured, the option "expected_votes" must be adapted and
# the option "two_node" is not necessary anymore so have to change following line in
# cluster.conf file:
#

<cman expected_votes="1" two_node="1">

# by
<cman expected_votes="3">

# Do not forget to propagate changes to the rest of nodes in the cluster

ccs -h nodeA -p myriccipasswd --sync --activate

# Quorum disk timings
# ------------------------------------------------------------------------------------------

# Qdiskd should not be used in environments requiring failure detection times of less than
# approximately 10 seconds.
#
# Qdiskd will attempt to automatically configure timings based on the totem timeout and
# the TKO. If configuring manually, Totem's token timeout must be set to a value at least
# 1 interval greater than the the following function:
#
# interval * (tko + master_wait + upgrade_wait)
#
# So, if you have an interval of 2, a tko of 7, master_wait of 2 and upgrade_wait of 2,
# the token timeout should be at least 24 seconds (24000 msec).
#
# It is recommended to have at least 3 intervals to reduce the risk of quorum loss during
# heavy I/O load. As a rule of thumb, using a totem timeout more than 2x of qdiskd's
# timeout will result in good behavior.
#
# An improper timing configuration will cause CMAN to give up on qdiskd, causing a
# temporary loss of quorum during master transition.

# Show cluster basic information
# ------------------------------------------------------------------------------------------

# Before adding a quorum disk:

cat /etc/cluster/cluster.conf
   <?xml version="1.0"?>
   <cluster config_version="25" name="mycluster">
      <clusternodes>
         <clusternode name="nodeA" nodeid="1"/>
         <clusternode name="nodeB" nodeid="2"/>
      </clusternodes>
      <cman expected_votes="1" two_node="1">
         <multicast addr="239.192.XXX.XXX"/>
      </cman>
      <rm log_level="7"/>
</cluster>

clustat   # (RGManager cluster)
   Cluster Status for mycluster @ Thu Jul 31 17:13:41 2014
   Member Status: Quorate

    Member Name                                                     ID   Status
    ------ ----                                                     ---- ------
    nodeA                                                              1 Online, Local
    nodeB                                                              2 Online

cman_tool status
   Version: 6.2.0
   Config Version: 25
   Cluster Name: mycluster
   Cluster Id: 4946
   Cluster Member: Yes
   Cluster Generation: 168
   Membership state: Cluster-Member
   Nodes: 2
   Expected votes: 1
   Total votes: 2
   Node votes: 1
   Quorum: 1
   Active subsystems: 8
   Flags: 2node
   Ports Bound: 0 11
   Node name: nodeA
   Node ID: 1
   Multicast addresses: 239.192.XXX.XXX
   Node addresses: XXX.XXX.XXX.XXX

# After adding a quorum disk:

cat /etc/cluster/cluster.conf
   <?xml version="1.0"?>
   <cluster config_version="26" name="mycluster">
      <clusternodes>
         <clusternode name="nodeA" nodeid="1"/>
         <clusternode name="nodeB" nodeid="2"/>
      </clusternodes>
      <cman expected_votes="3">
         <multicast addr="239.192.XXX.XXX"/>
      </cman>
      <rm log_level="7"/>
      <quorumd interval="3" label="quorum_disk" tko="8" votes="1">
         <heuristic program="/sbin/ping nodeC -c1 -w1" tko="8"/>
         <heuristic program="/sbin/ping nodeD -c1 -w1" tko="8"/>
         <heuristic program="/sbin/ping nodeE -c1 -w1" tko="8"/>
      </quorumd>
      <totem token="70000"/>
</cluster>

clustat   # (RGManager cluster)
   Cluster Status for mycluster @ Thu Jul 31 17:20:07 2014
   Member Status: Quorate

    Member Name                                                     ID   Status
    ------ ----                                                     ---- ------
    nodeA                                                              1 Online, Local
    nodeB                                                              2 Online
    /dev/sdb                                                           0 Online, Quorum Disk

cman_tool status
   Version: 6.2.0
   Config Version: 28
   Cluster Name: mycluster
   Cluster Id: 4946
   Cluster Member: Yes
   Cluster Generation: 168
   Membership state: Cluster-Member
   Nodes: 2
   Expected votes: 3
   Quorum device votes: 1
   Total votes: 3
   Node votes: 1
   Quorum: 2
   Active subsystems: 11
   Flags:
   Ports Bound: 0 11 177 178
   Node name: nodeA
   Node ID: 1
   Multicast addresses: 239.192.XXX.XXX
   Node addresses: XXX.XXX.XXX.XXX