Implementing Open Source

iSCSI failover cluster with Pacemaker

This howto will show the different steps that can be followed to create an operational iSCSI cluster of two nodes in failover mode. By coupling DRBD synchronization and Pacemaker, this will build a redundant shared storage system that is also cheap.

This howto will also teach you the basic commands to use to start successfully with Pacemaker.

Initializing the cluster

Before launching the Pacemaker cluster, we need to set its basic configuration parameter in the /etc/corosync/corosync.conf. This file is organized in various sections :

totem : this goes about the multicast configuration used by all nodes to exchange information like configuration, quorum, hearbeat, …
service : which cluster framework to launch together with Corosync
aisexec : execution parameters (which user, …)
logging : how, where, what to log

Here is our configuration file :

totem {
   version: 2
   token: 5000
   token_retransmits_before_loss_const: 10
   join: 1000
   consensus: 2500
   vsftype: none
   max_messages: 20
   send_join: 45
   clear_node_high_bit: yes
   secauth: off
   threads: 0
   interface {
      ringnumber: 0
      bindnetaddr: 192.168.20.0
      mcastaddr: 226.94.1.1
      mcastport: 5405
   }
}

logging {
   fileline: off
   to_stderr: no
   to_logfile: yes
   to_syslog: yes
   logfile: /var/log/cluster/corosync.log
   debug: off
   timestamp: on
   syslog_facility: daemon
}

amf {
   mode: disabled
}

aisexec {
   user: root
   group: root
}

service {
   name: pacemaker
   ver: 0
   use_mgmtd: yes
   use_logd: yes
}

In bold, the most important parameters. The mcastaddr and mcastport are the multicast address and port you will use to let the nodes of the cluster communicate to each others. It must not be overlap with the one of another cluster that may be running on the same network.

Bindnetaddr is the address of the network where the nodes are located.

Then name will be the name of the cluster service to use. In our case, only pacemaker can be used.

The last step is to start the cluster with /etc/init.d/corosync and to be sure that this script is launched automatically at each boot :

# chkconfig –-add corosync
# chkconfig –-level 3 corosync on

This configuration file must be manually updated on each node taking part into the cluster. This is the only configuration that must be done on all nodes. Once this file is created and the cluster service started on each node, the configuration is done by talking to cluster daemon on one node, using the crm command. There is a cluster daemon that manages automatically the distribution of the configuration to all nodes of the cluster.

Once the cluster daemons are running, you can check the status of the cluster by launching the command crm_mon on any node.

At this stage, our cluster is “empty”, but you can already use the crm configure show command to display the current – empty – configuration.

To end the initialization of the cluster, we will set some global properties of the cluster :

# crm configure property stonith-enabled=false
# crm configure property no-quorum-policy=ignore
# crm configure rsc_defaults resource-stickiness=100

What we set by doing so are, in the same order :

we disable the use of STONITH device (Shoot The Other Node In The Head – in case of fail-over, a mechanism to send a command to completely shut down the other node to be sure that I will be well deactivated)
by default, the quorum state is lost when a number less or equal to the half of nodes stay in cluster. In our case of two nodes fail-over cluster, if one node disappear that automatically the quorum state is lost (only 1 node survive, which is equal to the half of the total numbers of nodes – 2). And a cluster who loose its quorum state won't switch the resources anymore. So we need to ignore the loss of quorum to be able to switch the resources, hence this setting.
By setting a certain positive level of resource stickiness, we ensure that after a fail-over, the resource continue to run on the node who survived and are not re-switched to the returning failed node. (switching resources stops them and increase the risk of problems).

Lastly, we will have to create at least one DRBD filesystem (this happens outside Pacemaker) like described in this other howto.

To the top

Adding resources to the cluster

In Pacemaker, the resources will be management by the resource agents (RA). These are scripts that the cluster will execute to perform tasks like starting & stopping service, requesting status, check, ...

To see which classes and providers are available for your resource agents, you can do :

# crm ra classes
heartbeat
lsb
ocf / heartbeat linbit pacemaker redhat
stonith

In the output, the first word of each line is the name of a defined class on your cluster.

The names after a / (slash) are the name of the providers defined on your cluster.

To see which resource agents (ra) are available in a given class/provider, you can do :

# crm ra ocf heartbeat

This will list all resource agents scripts

To know more information about a resource agent (accepted parameters, default value, mandatory parameters, description, …), you can do :

# crm ra meta ocf:heartbeat:IPAddr2

All OCF resource agents are scripts installed under /usr/lib/ocf/resource.d directory

Now that we can see with RA to use, we will add the resources needed to build our iSCSi cluster. The RA we will use are :

ocf:heartbeat:IPAddr2 to assign a (virtual) IP to the node where the iSCSI target will be running
ocf:linbit:drbd to manage DRBD (launch the synchronisation, do the promotion to primay or the demotion to secondary, …)
ocf:heartbeat:iSCSITarget to setup the iSCSI target
ocf:heartbeat:iSCSILogicalUnit to bind our DRBD partition with the iSCSI target (because we can bind more than one LUN to a target, the configuration of them has been split into 2 RA's)

The sequence of commands to create the resources with each RA will be :

# crm configure primitive iscsi-drbd ocf:linbit:drbd \
        params drbd_resource="iscsi" \
        op monitor interval="30s" \
        op start interval="0" timeout="240s" \
        op stop interval="0" timeout="100s"
# crm configure primitive iscsi-ip ocf:heartbeat:IPaddr2 \
        params ip="192.168.20.122" cidr_netmask="32" \
        op monitor interval="30s"
# crm configure primitive iscsi-target ocf:heartbeat:iSCSITarget \
        params iqn="iqn.2011-11.begetest.net" tid="1" \
        op monitor interval="30s"
# crm configure primitive iscsi-lun ocf:heartbeat:iSCSILogicalUnit \
        params target_iqn="iqn.2011-11.begetest.net" lun="2" path="/dev/drbd1"

With the keyword params, you specify the parameter and its value, each pair separated to the next one by a space. The parameter name is the one that the crm ra meta <ra> command display.

With the keyword op, you change the default values for the defined operations (like starting, stoping or monitoring the resource). These operations and their default values are also shown by the crm ra meta <ra> command.

Once the resource is defined, you can modify a given parameter of it by using the following commands:

# crm resource param <resource> set <name> <value>
# crm resource param <resource> delete <name>
# crm resource param <resource> show <name>

to respectively:

add a new pair parameter / value or change an existing one,
remove a given parameter
show the value of a given parameter

To the top

Master-slave resources

Until now, with the use of the “primitive” configuration directive, we are defining a resource which will be started or stopped on one node by the cluster. And such a resource can only be started on one node and stopped on all the other nodes.

As you may guess, for DRBD, it should be slightly different. Indeed, there must be a DRBD resource running on both node of the cluster, one being primary and the other one being secondary. And when a switch or fail-over occurs, it is not a matter of starting or stopping the resource but to promote or demote it.

In Pacemaker, this can be achieved by using the so-called “master-slave” (ms) configuration stanza. For our cluster, the command will be :

# crm configure ms iscsi-meta iscsi-drbd \
     meta master-max="1" master-node-max="1" \
     clone-max="2" clone-node-max="1" notify="true"

In our command above, iscsi-meta is the name we will give to this resource, iscsi-drbd being the primitive name that must be configured as a master-slave resource. The second line specify how many master (primary) can be run at the same time. Obviously, in our two nodes setup, we only need one primary in the whole cluster.

To the top

Ordering and grouping resources

As you see, to run an iSCSI target into our cluster, we need to define 4 primitives and 1 master-slave resource. But all these resources when started must run together on the same node. There is no point to run the iSCSI target on one node, and the DRBD primary on another one.

Also, some resource must be started before others : DRBD must be primary before we setup the target and the logical unit on it. The target must be started before we start the logical unit.

Ordering will be configured with the configuration stanza order :

# crm configure order ip-after-iscsimeta inf: iscsi-meta:promote iscsi-ip:start
# crm configure order lun-after-target inf: iscsi-target:start iscsi-lun:start
# crm configure order target-after-meta inf: iscsi-meta:promote iscsi-target:start

The first value is the identifier we give to the ordering pair. Then we have the advisory-score of the order. In this case we use inf: for infinity, meaning that this order is mandatory.

Then we have the first resource to start then the second one.

Then we need to tell Pacemaker that all resources must be kept on the same node when started or in the primary role for the master-slave resources :

# crm configure colocation ip-with-iscsimeta inf: iscsi-ip iscsi-meta:Master
# crm configure colocation lun-with-meta inf: iscsi-lun iscsi-meta:Master
# crm configure colocation target-with-meta inf: iscsi-target iscsi-meta:Master

The first value is the identifier we give to the grouping pair. Then we have the advisory-score which is here infinity (meaning mandatory). Then the resources (with its state after the colon) that must be grouped together.

Like explained in the help of the crm command, I've also tried to use more than 2 resources with the order and colocation stanza, but it did not lead to the same result as using only a combination of pair ordering and colocation.

I still don't know why.

And that's it, we have now configured all we have to let the cluster runs our iSCSI target in a fail-over mode.

To the top

The resulting configuration

With all the above actions realized, the resulting configuration of the cluster (you got it by doing crm configure show) at the command prompt, will be :

node storage1 \
        attributes standby="off"
node storage2 \
        attributes standby="off"
primitive iscsi-drbd ocf:linbit:drbd \
        params drbd_resource="iscsi" \
        op monitor interval="30s" \
        op start interval="0" timeout="240s" \
        op stop interval="0" timeout="100s"
primitive iscsi-ip ocf:heartbeat:IPaddr2 \
        params ip="192.168.20.122" cidr_netmask="32" \
        op monitor interval="30s"
primitive iscsi-lun ocf:heartbeat:iSCSILogicalUnit \
        params target_iqn="iqn.2011-11.begetest.net" lun="2" path="/dev/drbd1"
primitive iscsi-target ocf:heartbeat:iSCSITarget \
        params iqn="iqn.2011-11.begetest.net" tid="1" \
        op monitor interval="30s" \
        meta target-role="Started"
ms iscsi-meta iscsi-drbd \
        meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" target-role="Started"
location iscsi-prefer-node-1 iscsi-meta 100: storage1
colocation ip-with-iscsimeta inf: iscsi-ip iscsi-meta:Master
colocation lun-with-meta inf: iscsi-lun iscsi-meta:Master
colocation target-with-meta inf: iscsi-target iscsi-meta:Master
order ip-after-iscsimeta inf: iscsi-meta:promote iscsi-ip:start
order lun-after-target inf: iscsi-target:start iscsi-lun:start
order target-after-meta inf: iscsi-meta:promote iscsi-target:start
property $id="cib-bootstrap-options" \
        dc-version="1.1.2-f059ec7ced7a86f18e5490b67ebf4a0b963bccfe" \
        cluster-infrastructure="openais" \
        expected-quorum-votes="2" \
        stonith-enabled="false" \
        no-quorum-policy="ignore"

To the top

Content recently modified

HA on filesystems

Virtualization

GFS on iSCSI shared storage

Welcome !

Inventory management

Pages

Did you know that ...

On the server market, Linux part is between 25 and 28%
Almost one half of the most powerful super computers present in the Top 500 runs Linux
But on the desktop market, the Linux part is only 0,8%

February 2008

Search form

Main menu

You are here