This is a mini cookbook which is designed to help users get DRBD up and running in a 2-node cluster with GFS on top. Maintained by LonHohberger
So, you want to try playing with GFS, but you do not have a SAN. GFS, as you know, is not a distributed file system. Rather, it is a shared disk cluster file system. This means that in order to use it on two or more computers, you must have a disk shared between them.... or, do you?
DRBD is a RAID-1 style block device which synchronizes over the network between two computers. In essence, it provides a virtual shared storage between two computers. As of 0.8, DRBD can be used with GFS (and other shared disk cluster file systems) due to the addition of concurrent writer support.
Ok, so, you want GFS-on-DRBD. Here's how, sort of...
Before you start
- This was written for RHEL5/CentOS5. Input from other distribution users is appreciated.
- You will need gfs-utils or gfs2-utils. On RHEL5, get both (since gfs-utils requires the latter).
- You may *not* use DRBD as a quorum disk in a 2-node cluster.
- DRBD in active/active mode works only in two node clusters (at least, the free version...)
- As of this writing, Red Hat does not ship nor support DRBD.
- Performance is unlikely to be very good in this configuration.
Do not try this on shared storage. DRBD is for use on storage which is not shared.
Basic CMAN Configuration
For DRBD to work in its simplest form on Linux-Cluster, you will need a valid, two node cluster configuration. This includes using expected_votes="1" and two_node="1" in the <cman> tag of cluster.conf and, more importantly, fencing. Even though DRBD may not require fencing in all circumstances, GFS does. Here is an example cluster configuration (/etc/cluster/cluster.conf) for a 2-node virtual cluster using XVM fencing:
<?xml version="1.0"?> <cluster alias="lolcats" config_version="41" name="lolcats"> <cman expected_votes="1" two_node="1"/> <clusternodes> <clusternode name="frederick" nodeid="1" votes="1"> <fence> <method name="1"> <device domain="frederick" name="xvm"/> </method> </fence> </clusternode> <clusternode name="molly" nodeid="2" votes="1"> <fence> <method name="1"> <device domain="molly" name="xvm"/> </method> </fence> </clusternode> </clusternodes> <fencedevices> <fencedevice agent="fence_xvm" name="xvm"/> </fencedevices> <rm/> </cluster>
How to configure linux-cluster / CMAN is beyond the scope of this document.
DRBD Configuration
Fabio C. gave me his configuration (/etc/drbd.conf) file, which I tweaked for my Xen cluster. You will need to adapt the below configuration to your environment.
global { usage-count yes; } common { syncer { rate 100M; } } resource the-disk { protocol C; startup { wfc-timeout 20; degr-wfc-timeout 10; # become-primary-on both; # Enable this *after* initial testing } net { cram-hmac-alg sha1; shared-secret "happy2008everybody"; allow-two-primaries; } on molly { device /dev/drbd1; disk /dev/xvdd; address 10.12.32.98:7789; meta-disk internal; } on frederick { device /dev/drbd1; disk /dev/xvdd; address 10.12.32.99:7789; meta-disk internal; } disk { fencing resource-and-stonith; } handlers { outdate-peer "/sbin/obliterate"; # We'll get back to this. } }
The obliterate script is available here, and was last updated on 6-Dec-2007. This script calls forth CMAN's fencing to smite the other node in a 2-node cluster. Once the obliterate script terminates (successfully, of course), DRBD will recover. There are some things to be aware of with the initial implementation:
obliterate only tries once, instead of "until success" like CMAN
- The dead node will get fenced twice. Noted in the script is a more correct method of killing the node. Mostly, I just wanted to get a PoC out there. Fixing it is not difficult.
Ok, back to the configuring! Create the metadata on both nodes and make DRBD think it is consistent:
[root@molly ~]# drbdadm create-md the-disk v08 Magic number not found v07 Magic number not found About to create a new drbd meta data block on /dev/xvdd. ==> This might destroy existing data! <== Do you want to proceed? [need to type 'yes' to confirm] yes Creating meta data... initialising activity log NOT initialized bitmap (32 KB) New drbd meta data block sucessfully created. success [root@molly ~]# drbdadm -- 6::::1 set-gi the-disk previously 0000000000000004:0000000000000000:0000000000000000:0000000000000000:0:0:0:0:0:0 set GI to 0000000000000006:0000000000000000:0000000000000000:0000000000000000:1:0:0:0:0:0 Write new GI to disk? [need to type 'yes' to confirm] yes
Start up DRBD on both nodes (as close to the same time as you can):
[root@molly ~]# service drbd start Starting DRBD resources: [ d0 s0 n0 ].
Check the state on both nodes using the following commands. It should say "Secondary/Secondary".
[root@molly ~]# drbdadm state all Secondary/Secondary
Promote both nodes to primary mode.
[root@molly ~]# drbdadm primary all
Check the state again - this time, it should say "Primary/Primary".
[root@molly ~]# drbdadm state all Primary/Primary
One thing that is important to note: when using GFS with DRBD, if you want GFS file systems to be mounted on system startup, you must make the drbd init script's 'start' operation to occur between when the cman init script is called and when the gfs init script is called. Edit /etc/init.d/drbd and change the chkconfig line (#3 in 0.8.2.1) to:
chkconfig: 345 22 75
You may now enable DRBD on startup by running the following command on both nodes:
chkconfig --level 345 drbd on
Important: In order for DRBD to automatically work in active/active mode, you must uncomment the become-primary-on line in /etc/drbd.conf on both nodes. Failure to do this will cause each cluster node to start in the 'Secondary' state - blocking access to GFS volumes.
Making a GFS volume
Now comes the fun part! First of all, we need to make a mount point on both nodes:
mkdir /mnt/drbdtest
Next, we need to create a file system. The format of the gfs_mkfs command is a little more complicated than traditional file systems (like ext3). We need to know the locking protocol (usually 'lock_dlm'), a lock table name, and the number of journals. With DRBD 0.8, the number of journals will always be '2' - because it only allows two concurrent writers at a time. The only complicated one is the lock table, which takes the form of <clustername>:<file_system_name>. My cluster is named 'lolcats' and I decided to call my file system 'drbdtest'. Here is what the output of gfs_mkfs looks like:
[root@molly ~]# gfs_mkfs -p lock_dlm -t lolcats:drbdtest /dev/drbd1 -j 2 This will destroy any data on /dev/drbd1. Are you sure you want to proceed? [y/n] y Device: /dev/drbd1 Blocksize: 4096 Filesystem Size: 190420 Journals: 2 Resource Groups: 8 Locking Protocol: lock_dlm Lock Table: lolcats:drbdtest Syncing... All Done
Once this is done, you can mount it on both nodes:
mount -t gfs /dev/drbd1 /mnt/drbdtest
You should be able to see it in the output of the mount command at this point. More importantly, however, is what you see when you see what CMAN has to say about it:
[root@molly ~]# cman_tool services type level name id state fence 0 default 00010001 none [1 2] dlm 1 drbdtest 00040002 none [1 2] gfs 2 drbdtest 00030002 none [1 2]
If you've gotten this far, you can now do crazy things like create files in /mnt/drbdtest and see their contents from the other node!
Making a GFS2 volume
You may also run GFS2 on top of DRBD. This is quite similar to creating a GFS volume, as noted above. The primary difference is the creation program:
[root@molly ~]# mkfs.gfs2 -p lock_dlm -t lolcats:drbdtest /dev/drbd1 -j 2 This will destroy any data on /dev/drbd1. Are you sure you want to proceed? [y/n] y Device: /dev/drbd1 Blocksize: 4096 Device Size 1.00 GB (262127 blocks) Filesystem Size: 1.00 GB (262125 blocks) Journals: 2 Resource Groups: 4 Locking Protocol: "lock_dlm" Lock Table: "lolcats:drbdtest"
Once this is done, you can mount it on both nodes:
mount -t gfs2 /dev/drbd1 /mnt/drbdtest
That's it! Try the same status commands as you did with GFS.
Resources
http://www.drbd.org/fileadmin/drbd/doc/8.0.2/en/drbd.conf.html - DRBD config file documentation
http://osdir.com/ml/linux.kernel.drbd.devel/2006-11/msg00005.html - Documentation for return codes of the DRBD fencing callout
http://drbd-plus.linbit.com/examples:skip-initial-sync - Where I got the tweak to skip the initial sync done.
Thanks to fabioc on freenode for his efforts.