[파일][클러스터] DRBD HowTo Documents
DRBD HOWTO
1. Introduction
1.1. What is DRBD?
DRBD is a kernel module and associated scripts that provide a block device which is designed to build high availability clusters. This is done by mirroring a whole block device via (a dedicated) network. You could see it as a network raid
1.2. What is the scope of drbd, what else do I need to build HA clusters?
Drbd takes over the data, writes it to the local disk and sends it to the other host. On the other host, it takes it to the disk there.
The other components needed are a cluster membership service, which is supposed to be heartbeat, and some kind of application that works on top of a block device.
Examples:
• A filesystem & fsck
• A journaling FS.
• A database with recovery capabilities.
1.3. How does it work ?
Each device (drbd provides more than one of these devices) has a state, which can be ‘primary’ or ‘secondary’. On the node with the primary device the application is supposed to run and to access the device (/dev/nbX). Every write is sent to the local ‘lower level block device’ and to the node with the device in ‘secondary’ state. The secondary device simply writes the data to its lower level block device. Reads are always carried out locally.
If the primary node fails, heartbeat is switching the secondary device into primary state and starts the application there. (If you are using it with a non-journaling FS this involves running fsck)
If the failed node comes up again, it is a new secondary node and has to synchronise its content to the primary. This, of course, will happen whithout interruption of service in the background.
1.4. How is drbd related to current HA clusters ?
To my knowledge most current HA clusters (HP, Compaq, …) are using shared storage devices, thus the storage devices are connected to more than one node (This can be done with shared SCSI busses or Fibre Channel).
Drbd gives you about the same semantics as a shared device, but it does not need any uncommon hardware. It runs on top of IP networks, which are to my impression less expensive than special storage networks.
Currently drbd grants read-write access only to one node at a time, which is sufficient for the usual fail-over HA cluster. Although it is currently not on my task list, it would not be a great effort to allow both nodes read-write access. This would be useful with GFS for example.
2. Installation
2.1. Download the software
The latest stable version is 0.5.8, and it is available for download at http://www.complang.tuwien.ac.at/reisner/drbd/download/
2.2. Compile the package
Compilation is should be pretty straight forward. Simply do a
$ make
$ make install
2.3. Test loading the drbd module
If everything built and installed correctly, you should be able to test loading the module
$ /sbin/insmod drbd
If everything is in working order, you should see no error messages, verify with lsmod that the module is loaded
$ /sbin/lsmod
If you see drbd, all is looking good. Go ahead and rmmod and move to the sample configuration section.
$ /sbin/rmmod drbd
2.4. 2.4.x Kernel Considerations
DRBD Version 0.5.8 only works with the 2.2.x kernel series. For the 2.4.x kernel, you need to download the latest version from cvs.
2.5. Unresolved symbols
When loading the module you see about 50 messages like these:
drbd.o: unresolved symbol sock_alloc
drbd.o: unresolved symbol proc_register
drbd.o: unresolved symbol schedule_timeout
…
Your Kernel was built with CONFIG_MODVERSIONS and the DRBD module was built without MODVERSIONS, or vice versa.
There are two ways to solve this:
• Use a system without MODVERSIONS: Change your kernel configuration and drop the CONFIG_MODVERSIONS option. (See ~linux/.config.) Rebuild the kernel.
• Use a system with MODVERSIONS: Edit ~drbd/Makefile.vars and add -DMODVERSIONS -DCONFIG_MODVERSIONS to KERNFLAGS. Rebuild DRBD.
3. Using drbdsetup
3.1. drbdsetup
drbsetup is the low level configuration tool of the drbd program suite. You can use it to associate drbd devices with lower level block devices, set up drbd device pairs to mirror their low level block devices, and inspect configurations of running drbd devices.
3.2. Example of using drbdsetup
Let’s assume that your two machines are named node1 (10.0.0.10) and node2 (10.0.0.20), and you want to use /dev/hdc6 as the lower level block device on both machines. On node2, you would issue the following commands:
$ insmod drbd.o
$ drbdsetup /dev/nb0 disk /dev/hdc6
$ drbdsetup /dev/nb0 net 10.0.0.20 10.0.0.10 B
On node1 you would issue the following commands:
$ insmod drbd.o
$ drbdsetup /dev/nb0 disk /dev/hdc6
$ drbdsetup /dev/nb0 net 10.0.0.10 10.0.0.20 B
$ drbdsetup /dev/nb0 primary
At this point, you can use the /dev/nb0 just like any other device
$ mkfs -b 4096 /dev/nb0
$ mount /dev/nb0 /mnt/mountpoint
In the example above, the “B” protocol is used. drbd allows you to select the protocol which controls how data is written to the secondary device.
Table 1. DRBD Protocols
Protocol Description
A A write operation is complete as soon as the data is written to disk and sent to the network.
B A write operation is complete as soon as a reception acknowledgement arrives.
C A write operation is complete as soon as a write acknowledgement arrives.
There are also additional paramaters you can pass to the disk and net options. See the drbdsetup man page for additional information
4. drbd.conf Sample Configuration
4.1. drbd.conf
In the previous section, we went over using drbdsteup. drbd also allows you to setup everything in a drbd.conf file. By correctly configuring this file and using the init.d/drbd script you can easily make drbd come up correctly when a machine boots up.
4.2. Sample drbd.conf setup
In this configuration, we have 2 machines named thost1 and thost2. The IP address for thost1 is 10.1.1.31, and the IP address for thost2 is 10.1.1.32. We want to create a mirror between /dev/hda7 on thost1, and /dev/hda7 on machine thost2. Here is a sample /etc/drbd.conf file to accomplish this:
resource drbd0 {
protocol=B
fsck-cmd=fsck.ext2 -p -y
on thost1 {
device=/dev/nb0
disk=/dev/hda7
address=10.1.1.31
port=7789
}
on thost2 {
device=/dev/nb0
disk=/dev/hda7
address=10.1.1.32
port=7789
}
}
After you create the drbd.conf file, go to thost1, and run the following command:
$ /etc/rc.d/init.d/drbd start
Do the same thing on thost2
$ /etc/rc.d/init.d/drbd start
At this point, you should have a mirror between the 2 devices. You can verify this by looking at /proc/drbd
$ cat /proc/drbd
Now you can make a filesystem on the device, and mount it on machine thost1.
$ mkfs /dev/nb0
$ mount /dev/nb0 /mnt/disk
At this point, you have now created a mirror using drbd. Congrats. To move ahead with creating a highly available failover system, look into the the scripts subdirectory and integrate with the heartbeat software available at linux-ha.org
5. Integrating with the heartbeat package
5.1. What is heartbeat?
The Linux-HA heartbeat code is used for building highly available failover clusters. It can do 2 node IP takeover for an unlimited number of IP interfaces. It works by sending a “heartbeat” between 2 machines either over a serial cable, ethernet, or both. If the heartbeat fails, the secondary machine will assume the primary machine has failed, and take over services that were running on a primary machine. For more information on the heartbeat software, you can visit linux-ha.org.
5.2. drbd heartbeat scripts
drbd comes with 2 scripts that make it very easy to integrate with the heartbeat package. The first script is drbd, and it installs to the /etc/rc.d/init.d/drbd. It is intended to be used to start up drbd services upon boot. The second script is datadisk, and it installs Into /etc/ha.d/resource.d/datadisk. The datadisk script handles switching a drbd device from secondary to primary state, and it is called from the /etc/ha.d/haresources file.
5.3. Sample heartbeat integration – web server
This example will build upon the previous example from the drbd.conf section. In that example, we have 2 machines named thost1 (10.1.1.31) and thost2 (10.1.1.32). In this example, we will st up a web server which has it’s html pages stored on a shared drbd device. Eventually, you’ll want all of these steps to happen automatically, but for this HOWTO I’ll just go over the steps to do things manually.
The first step is to correctly configure the heartbeat package. I’m not going to go into setting up the heartbeat package, I assume if you have gotten this far, you have heartbeat already running. Assuming heartbeat is running, and your drbd.conf file is correctly configured, the first thing you need to do is start up drbd on both nodes. On thost1, issue the following commands:
$ insmod drbd
$ /etc/rc.d/init.d/drbd start
Do the same thing on thost2:
$ insmod drbd
$ /etc/rc.d/init.d/drbd start
At this point drbd should be running. Make sure drbd is the primary interface on thost1, and then go ahead and mount the disk. If you haven’t done so already, create a file system on the device. Lastly, configure the web server on both machines to point to the location you will mount the drbd device to become document root for your web server.
[root@10-0-1-31 ha.d]# cat /proc/drbd
version : 58
0: cs:Connected st:Primary/Secondary ns:208 nr:36 dw:88 dr:373 gc:5,25,13
Assuming the /etc/ha.d/ha.cf file correctly identifies both nodes in your cluster, the next step is to edit the /etc/ha.d/haresources file. Add the following line to that file
10-0-1-31.linux-ha.org 10.0.10.10/16 datadisk::drbd0 httpd
The 10.0.10.10 address is the IP address attached to the web server. See the heartbeat documentation for more info on that. Basically, that line tells the heartbeat software to run the /etc/ha.d/resource.d/datadisk script with a paramater of drbd0. In the event of a failure, the datadisk script will run on the secondary node, and switch the drbd0 drbd device from seconday to primary state.
You also need to edit the /etc/fstab file on both machines. It is important to configure this device to NOT mount upon boot. The noauto flag will take care of that.
/dev/nb0 /mnt/disk ext2 noauto 0 0
At this point, you can start up the heartbeat software.
/etc/rc.d/init.d/heartbeat start
If everything is configured correctly, the heartbeat software will launch the web server and tie it to the virtual interface. At this point, you can power off thost1. If all goes well, in about 30 seconds, thost2 will take over and launch the web server. If that test works, go ahead and setup drbd and heartbeat to start upon boot, and sit back and enjoy the high availability!
5.4. Sample heartbeat integration – NFS server
This short example will demonstrate some of the things you need to take into account when failing over services which require state information to be kept. We will set up an NFS server which exports a filesystem stored on a shared drbd device. Again, we’ll just consider the steps to do things manually
We’re going to assume that you’ve set up the heartbeat package just as above, and just go over the changes required for setting up a failover NFS service instead of the web service. So, just as before, on thost1 issue the following commands:
$ insmod drbd
$ /etc/rc.d/init.d/drbd start
Do the same thing on thost2:
$ insmod drbd
$ /etc/rc.d/init.d/drbd start
At this point drbd should be running. Make sure drbd is the primary interface on thost1. We’ll assume that you will be mounting the drbd device at /mnt/disk. At least on Red Hat and SuSE based systems, the state information for the NFS server is in /var/lib/nfs, and we will want this state information to be found on the shared device. So, on thost1, with the NFS server stopped, do:
$ mkdir /mnt/disk
$ mount /dev/nb0 /mnt/disk
$ mkdir /mnt/disk/var; mkdir /mnt/disk/var/lib
$ mv /var/lib/nfs /mnt/disk/var/lib
$ ln -s /mnt/disk/var/lib/nfs /var/lib/nfs
On thost2, ensure the NFS server is not running, and execute the following:
$ mkdir /mnt/disk
$ rm -r /var/lib/nfs
$ ln -s /mnt/disk/var/lib/nfs /var/lib/nfs
The last thing is to set up the part of the filesystem we will actually export over the NFS mount. This will be the hierarchy below /mnt/disk/export. Assume that the NFS client has IP address 10.0.20.1. So, on thost1:
$ mkdir /mnt/disk/export
$ echo “/mnt/disk/export 10.0.20.1(rw)” >> /etc/exports
On thost2:
$ echo “/mnt/disk/export 10.0.20.1(rw)” >> /etc/exports
Assuming the /etc/ha.d/ha.cf file correctly identifies both nodes in your cluster, the next step is to edit the /etc/ha.d/haresources file. Add the following line to that file on BOTH machines
10-0-1-31.linux-ha.org 10.0.10.10/16 datadisk::drbd0 nfsserver
You need to make sure that the nfsserver script can be found by the heartbeat package; see the heartbeat documentation for more details.
The 10.0.10.10 address is the IP address attached to the NFS server. See the heartbeat documentation for more info on that. Basically, that line tells the heartbeat software to run the /etc/ha.d/resource.d/datadisk script with a paramater of drbd0. In the event of a failure, the datadisk script will run on the secondary node, and switch the drbd0 drbd device from seconday to primary state.
You also need to edit the /etc/fstab file on both machines. It is important to configure this device to NOT mount upon boot. The noauto flag will take care of that.
/dev/nb0 /mnt/disk ext2 noauto 0 0
At this point, you can start up the heartbeat software.
/etc/rc.d/init.d/heartbeat start
If everything is configured correctly, the heartbeat software will launch the NFS server and tie it to the virtual interface. At this point, you can power off thost1. If all goes well, in about 30 seconds, thost2 will take over and launch the NFS server. Test it by mounting the NFS mount from 10.0.20.1 and performing some file operation which will take more than 30 seconds; it should pause and then magically start going again, unconscious of the fact it’s just swapped servers. Enjoy your high availability!
6. Timeouts
6.1. How do timeouts work?
The primary node expects a reaction to some packet within a timeframe (this timeframe is adjustable by the –timeout option of drbdsetup). In case the timeout is not met by the other node the primary cuts the connection and tries to reastablish a connection.
6.2. Why is a small timeout good?
In case the other node dies, your primary node will sit there and block all applications which are writing to the DRBD device. Basically it takes the time of the timeout until it desides that the other node is dead. Thus your applications my be blocked for this time.
6.3. Why are small timeouts leading to timeout/resync/connect?
This is caused when the IO subsystem of the secondary node is slow.
6.4. What are “postpone packets”?
To improve the situation I had the idea of “postpone that deadline” packets. These are sent by the secondary node as soon as it realizes that it will miss the timeout.
6.5. What should you do if you see this timeout/resync/connect pattern?
Increase the timeout. (Since connect-int and pint-int need to be greater than the timout increase them as well)
7. Miscellaneous
7.1. tl-size
Need to write a paragraph on what to do if there is a message “transfer log too small” in the syslog.