RHCS: Install a two-node basic cluster

# Tested on CentOS 7

# Note: For the commands here after, [ALL] indicates that command has to be run on the two
#       nodes and [ONE] indicates that one needs to run it only on one of the hosts.

# The cluster installed here uses Pacemaker and Corosync to provide resource management
# and messaging.
# Pacemaker is a resource manager which, among other capabilities, is able to detect and
# recover from the failure of various nodes, resources and services under its control by
# using the messaging and membership capabilities provided by the choses cluster
# infrastructure (either Corosync or Heartbeat).
# Pacemaker main features:
# - Detection and recovery of node and service-level failures
# - Storage agnostic, no requirement for shared storage
# - Resource agnostic, anything that can be scripted can be clustered
# - Supports fencing for ensuring data integrity
# - Supports large and small clusters
# - Supports both quorate and resource-driven clusters
# - Supports practically any redundancy configuration
# - Automatically replicated configuration that can be updated from any node
# - Ability to specify cluster-wide service ordering, colocation and anti-colocation
# - Support for advanced service types
#       - Clones: for services which need to be active on multiple nodes
#       - Multi-state: for services with multiple modes
#            (e.g. master/slave, primary/secondary) 
# - Unified, scriptable cluster management tools 
# Pacemaker components:
# - Cluster Information Base (CIB)
# - Cluster Resource Management daemon (CRMd)
# - Local Resource Management daemon (LRMd)
# - Policy Engine (PEngine or PE)
# - Fencing daemon (STONITHd - "Shoot-The-Other-Node-In-The-Head")

# ------------------------------------------------------------------------------------------

# If a cluster splits into two (or more) groups of nodes that can no longer communicate
# with each other, quorum is used to prevent resources from starting on more nodes than
# desired, which would risk data corruption.
# A cluster has quorum when more than half of all known nodes are online in the same
# partition (group of nodes).
# For example, if a 5-node cluster split into 3- and 2-node partitions, the 3-node
# partition would have quorum and could continue serving resources. If a 6-node cluster
# split into two 3-node partitions, neither partition would have quorum; pacemaker’s
# default behavior in such cases is to stop all resources, in order to prevent data
# corruption.
# Two-node clusters are a special case. By the above definition, a two-node cluster would
# only have quorum when both nodes are running. This would make the creation of a two-node
# cluster pointless, but corosync has the ability to treat two-node clusters as if only
# one node is required for quorum.
# The pcs cluster setup command will automatically configure two_node: 1 in corosync.conf,
# so a two-node cluster will "just work".
# Depending of the versions of corosync, if may be that you will have to ignore quorum at
# the pacemaker level, using pcs property set no-quorum-policy=ignore.

# ------------------------------------------------------------------------------------------

# First of all, make sure that the two nodes are reachable on their IP addresses and that
# they are known by their names:

root@nodeA:/root#> cat /etc/hosts | egrep "nodeA|nodeB" nodeA nodeB

root@nodeA:/root#> ssh nodeB
root@nodeB's password:
Last login: Wed Jan 24 14:13:38 2018 from

root@nodeB:/root#> ssh nodeA
root@nodeA's password:
Last login: Wed Jan 24 14:13:38 2018 from

# In order to facilitate communications, de-activate SELinux and firewall service

# This may create significant security issues and should not be performed on machines
# that may be exposed to the outside world, but may be appropriate during development and
# testing on a protected host. 

[ALL] sed -i 's/SELINUX=enforcing/SELINUX=disabled/' /etc/selinux/config
[ALL] setenforce 0

[ALL] systemctl stop firewalld
[ALL] systemctl disable firewalld

# Install the needed packages

[ALL] yum install pacemaker pcs resource-agents

# Start (and enable) pcs daemon on both nodes

[ALL] systemctl start pcsd.service
[ALL] systemctl enable pcsd.service

# Configure pcs authentication

[ALL] echo "mypassword" | passwd --stdin hacluster
Changing password for user hacluster.
passwd: all authentication tokens updated successfully.

[ONE] pcs cluster auth nodeA nodeB -u hacluster -p mypassword --force
nodeA: Authorized
nodeB: Authorized

# Create the cluster and populate it with the nodes

[ONE] pcs cluster setup --force --name lar_cluster nodeA nodeB
Destroying cluster on nodes: nodeA, nodeB...
nodeA: Stopping Cluster (pacemaker)...
nodeB: Stopping Cluster (pacemaker)...
nodeA: Successfully destroyed cluster
nodeB: Successfully destroyed cluster

Sending 'pacemaker_remote authkey' to 'nodeA', 'nodeB'
nodeA: successful distribution of the file 'pacemaker_remote authkey'
nodeB: successful distribution of the file 'pacemaker_remote authkey'
Sending cluster config files to the nodes...
nodeA: Succeeded
nodeB: Succeeded

Synchronizing pcsd certificates on nodes nodeA, nodeB...
nodeA: Success
nodeB: Success
Restarting pcsd on the nodes in order to reload the certificates...
nodeA: Success
nodeB: Success

# Start the cluster

[ONE] pcs cluster start --all
nodeA: Starting Cluster...
nodeB: Starting Cluster...

# Enable necessary daemons so the cluster starts automatically on boot-up

[ALL] systemctl enable corosync.service
[ALL] systemctl enable pacemaker.service

# Verify corosync installation

[ONE] corosync-cfgtool -s
Printing ring status.
Local node ID 1
        id      =
        status  = ring 0 active with no faults

[ONE] corosync-cmapctl | grep members
runtime.totem.pg.mrp.srp.members.1.config_version (u64) = 0
runtime.totem.pg.mrp.srp.members.1.ip (str) = r(0) ip(
runtime.totem.pg.mrp.srp.members.1.join_count (u32) = 1
runtime.totem.pg.mrp.srp.members.1.status (str) = joined
runtime.totem.pg.mrp.srp.members.2.config_version (u64) = 0
runtime.totem.pg.mrp.srp.members.2.ip (str) = r(0) ip(
runtime.totem.pg.mrp.srp.members.2.join_count (u32) = 1
runtime.totem.pg.mrp.srp.members.2.status (str) = joined

# To check the status of the cluster run either one the two following commands

[ONE] pcs status
Cluster name: lar_cluster
WARNING: no stonith devices and stonith-enabled is not false
Stack: corosync
Current DC: nodeB (version 1.1.16-12.el7-94ff4df) - partition with quorum
Last updated: Thu Feb  8 17:39:33 2018
Last change: Thu Feb  8 17:37:46 2018 by hacluster via crmd on nodeB

2 nodes configured
0 resources configured

Online: [ nodeA nodeB ]

No resources

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

[ONE] crm_mon -1
Stack: corosync
Current DC: nodeB (version 1.1.16-12.el7-94ff4df) - partition with quorum
Last updated: Thu Feb  8 17:39:46 2018
Last change: Thu Feb  8 17:37:46 2018 by hacluster via crmd on nodeB

2 nodes configured
0 resources configured

Online: [ nodeA nodeB ]

No active resources

# Voilà! We have installed our basic cluster

# Raw cluster configuration can be shown by using following command:

[ONE] pcs cluster cib
<cib crm_feature_set="3.0.12" validate-with="pacemaker-2.8" epoch="5" num_updates="7" admin_epoch="0" cib-last-written="Thu Feb  8 17:37:46 2018" update-origin="nodeB" update-client="crmd" update-

user="hacluster" have-quorum="1" dc-uuid="2">
      <cluster_property_set id="cib-bootstrap-options">
        <nvpair id="cib-bootstrap-options-have-watchdog" name="have-watchdog" value="false"/>
        <nvpair id="cib-bootstrap-options-dc-version" name="dc-version" value="1.1.16-12.el7-94ff4df"/>
        <nvpair id="cib-bootstrap-options-cluster-infrastructure" name="cluster-infrastructure" value="corosync"/>
        <nvpair id="cib-bootstrap-options-cluster-name" name="cluster-name" value="lar_cluster"/>
      <node id="1" uname="nodeA"/>
      <node id="2" uname="nodeB"/>
    <node_state id="2" uname="nodeB" in_ccm="true" crmd="online" crm-debug-origin="do_state_transition" join="member" expected="member">
      <lrm id="2">
      <transient_attributes id="2">
        <instance_attributes id="status-2">
          <nvpair id="status-2-shutdown" name="shutdown" value="0"/>
    <node_state id="1" uname="nodeA" in_ccm="true" crmd="online" crm-debug-origin="do_state_transition" join="member" expected="member">
      <lrm id="1">
      <transient_attributes id="1">
        <instance_attributes id="status-1">
          <nvpair id="status-1-shutdown" name="shutdown" value="0"/>

# If ever we made some changes to the configuration manually, we can check the correction
# of the XML file by running this command:

[ONE] crm_verify -L -V
   error: unpack_resources:     Resource start-up disabled since no STONITH resources have been defined
   error: unpack_resources:     Either configure some or disable STONITH with the stonith-enabled option
   error: unpack_resources:     NOTE: Clusters with shared data need STONITH to ensure data integrity
Errors found during check: config not valid

# These errors will be ignored after disabling fencing.
# Cluster logs can be found in /var/log/pacemaker.log and /var/log/cluster/corosync.log
root@nodeA:/root#> cat /var/log/pacemaker.log
Set r/w permissions for uid=189, gid=189 on /var/log/pacemaker.log
Feb 08 17:37:24 [5487] nodeA pacemakerd:     info: crm_log_init:      Changed active directory to /var/lib/pacemaker/cores
Feb 08 17:37:24 [5487] nodeA pacemakerd:     info: get_cluster_type:  Detected an active 'corosync' cluster
Feb 08 17:37:24 [5487] nodeA pacemakerd:     info: mcp_read_config:   Reading configure for stack: corosync
Feb 08 17:37:24 [5487] nodeA pacemakerd:   notice: crm_add_logfile:   Switching to /var/log/cluster/corosync.log
root@nodeA:/root#> tail -20 /var/log/cluster/corosync.log
Feb 08 17:37:46 [5488] nodeA          cib:     info: cib_perform_op:    Diff: +++ 0.5.6 (null)
Feb 08 17:37:46 [5488] nodeA          cib:     info: cib_perform_op:    +  /cib:  @num_updates=6
Feb 08 17:37:46 [5488] nodeA          cib:     info: cib_process_request:       Completed cib_modify operation for section status: OK (rc=0, origin=nodeB/attrd/3, version=0.5.6)
Feb 08 17:37:46 [5488] nodeA          cib:     info: cib_process_request:       Forwarding cib_delete operation for section //node_state[@uname='nodeA']/transient_attributes to all (origin=local/crmd/13)
Feb 08 17:37:46 [5488] nodeA          cib:     info: cib_process_request:       Completed cib_delete operation for section //node_state[@uname='nodeA']/transient_attributes: OK (rc=0, origin=nodeA/crmd/13, version=0.5.6)
Feb 08 17:37:46 [5488] nodeA          cib:     info: cib_perform_op:    Diff: --- 0.5.6 2
Feb 08 17:37:46 [5488] nodeA          cib:     info: cib_perform_op:    Diff: +++ 0.5.7 (null)
Feb 08 17:37:46 [5488] nodeA          cib:     info: cib_perform_op:    +  /cib:  @num_updates=7
Feb 08 17:37:46 [5488] nodeA          cib:     info: cib_perform_op:    ++ /cib/status/node_state[@id='1']:  <transient_attributes id="1"/>
Feb 08 17:37:46 [5488] nodeA          cib:     info: cib_perform_op:    ++                                     <instance_attributes id="status-1">
Feb 08 17:37:46 [5488] nodeA          cib:     info: cib_perform_op:    ++                                       <nvpair id="status-1-shutdown" name="shutdown" value="0"/>
Feb 08 17:37:46 [5488] nodeA          cib:     info: cib_perform_op:    ++                                     </instance_attributes>
Feb 08 17:37:46 [5488] nodeA                  cib:     info: cib_perform_op:    ++                                   </transient_attributes>
Feb 08 17:37:46 [5488] nodeA          cib:     info: cib_process_request:       Completed cib_modify operation for section status: OK (rc=0, origin=nodeB/attrd/4, version=0.5.7)
Feb 08 17:37:46 [5488] nodeA          cib:     info: cib_file_backup:   Archived previous version as /var/lib/pacemaker/cib/cib-3.raw
Feb 08 17:37:46 [5488] nodeA          m01        cib:     info: cib_file_write_with_digest:        Wrote version 0.5.0 of the CIB to disk (digest: ef4905fd38cc2a926d5b6c686d3ab21e)
Feb 08 17:37:46 [5488] nodeA          cib:     info: cib_file_write_with_digest:        Reading cluster configuration file /var/lib/pacemaker/cib/cib.clWuxF (digest: /var/lib/pacemaker/cib/cib.gJSY1F)
Feb 08 17:37:51 [5488] nodeA          cib:     info: cib_process_ping:  Reporting our current digest to nodeB: 2cdda87849aa421905eb3901c98cb8c1 for 0.5.7 (0x563aecfc0970 0)
Feb 08 17:37:55 [5493] nodeA          crmd:     info: crm_procfs_pid_of: Found cib active as process 5488
Feb 08 17:37:55 [5493] nodeA          crmd:     info: throttle_send_command:     New throttle mode: 0000 (was ffffffff)
