Guide to Optimizing GlusterFS

Block Device Tuning

One problem which could manifest itself in older kernels (2.6.9 which for example is the stock kernel in Scientific Linux 4) is the size of io requests which are buffered before they are communicated to the disk. In 2.6.9 the default value of that for every block device is 8192. In the newer kernels it is 128. One can see that the change in default value is substantial. The trouble with the big value being that if the requests are quite big, then caching a huge number of them can easily lead you to run out of memory and have all kinds of problems.

You can check your current settings easily by just looking into the following file:

/sys/block/<DEV>/queue/nr_requests
so for example:

$ cat /sys/block/sda/queue/nr_requests
To set a new value to it you just echo it into the same file. For example the 3ware recommended value for nr_requests on their 9550SX series controllers is 512.

Bear in mind that this setting only lasts until reboot, so you may want to add the setting of it to /etc/rc.d/rc.local for example.

Readahead

Again a feature of 3ware controllers is that if you run with the default block device settings, then the read performance of the 3ware is terrible. Having the default settings will produce a read speed of the order of 40MB/s on a single dd operation of 10GB from disk to /dev/zero.

3ware itself recommends setting the readahead value to 16384 (the default is 256) which indeed for at least the streamed copy increases the speed to above 400MB/s. This has probably some performance impact on memory in case of more parallel streams and smaller files, but in case where the transfers are usually sequential per stream with bigger files, then setting the readahead to a higher value is of use indeed.

To see the current readahead setting of a block device use the blockdev command:

$ blockdev –getra /dev/sdX
for example:

$ blockdev –getra /dev/sda
  256
Setting the readahead to a new value also happens with the same command:

$ blockdev –setra 16384 /dev/sda

Swapiness Configuration

A common practice would be to put Keep GlusterFS Servers memory usage as low as possible . However , sometimes that’s not possible.

Consider The following Scenario :

  You run a GlusterFS Server on a 32 GB RAM server and you’re constantly using
  about 28 GB of ram for other processes ( that Don’t Preform almost Any I/O
  At all) . Whenever I/O Is Directed to This Server , It Reads Files from the
  underlaying file-system straight to the OS Cache ( Assuming Direct I/O Being
  Disabled ). But Since The Memory USage is already pretty high The Server
  Start Swapping Process Data. This Means That Every Time I/O is Issued to
  this servers ( Reads or Writes ) Swapping Occurs . This Over-burden acutally
  slows down Our I/O . but it is also caused By the I/O itself…
Solution:

a. Slightly Reduce Memory Load on The Server and verify Sane Behavior of other process on the server.

b. Configure swapiness kernel parameter to be very Low ( say 5 , instaed of the default 60 ). This Will Prevent Swapping due to I/O until memory is almost totally consumed(which should not happen frequently since the memory load has been reduced).

$ sysctl -w vm.swappiness=5

Virtual Memory Tuning

In the latest 2.6 kernels it seems that a few settings have changed with regard to how the virtual memory management is performed. Let’s have a quick look over a few of them.

Dirty pages cleanup

There are two important settings which control the kernel behaviour with regard to dirty pages in memory. They are:

vm.dirty_background_ratio
vm.dirty_ratio
The first of the two (vm.dirty_background_ratio) defines the percentage of memory that can become dirty before a background flushing of the pages to disk starts. Until this percentage is reached no pages are flushed to disk. However when the flushing starts, then it’s done in the background without disrupting any of the running processes in the foreground.

Now the second of the two parameters (vm.dirty_ratio) defines the percentage of memory which can be occupied by dirty pages before a forced flush starts. If the percentage of dirty pages reaches this number, then all processes become synchronous, they are not allowed to continue until the io operation they have requested is actually performed and the data is on disk. In case of high performance I/O machines, this causes a problem as the data caching is cut away and all of the processes doing I/O become blocked to wait for io. This will cause a big number of hanging processes, which leads to high load, which leads to unstable system and crappy performance.

Now the default values in Redhat Kernels (eg: 2.6.9-smp) these settings are background ratio 10% and synchronous ratio 40%. However with the 2.6.20+ kernels, the default values are respectably 5% and 10%. Now, it is not hard to reach that 10% level and block your system, this is exactly what you will face when trying to understand why our systems were performing poorly and being under high load while doing almost nothing. You can find a few parameters to watch, which can show you what the system was doing. The two values to monitor are from /proc/vmstat file and are:

$ grep -A 1 dirty /proc/vmstat
  nr_dirty 30931
  nr_writeback 0
If you monitor the values in your /proc/vmstat file you will notice that before the system reaches the vm.dirty_ratio barrier the number of dirty is a lot higher than that of writeback, usually writeback is close to 0 or occasionally flicks higher and then calms down again. Now if you do reach the vm.dirty_ratio barrier, then you will see the nr_writeback start to climb fast and become higher than ever before without dropping back. At least it will not drop back easily if the dirty_ratio is set to a too small number.

You can set both the variables by appending at the end of your /etc/sysctl.conf file:

$ grep dirty /etc/sysctl.conf
  vm.dirty_background_ratio = 3
  vm.dirty_ratio = 40
and then executing:

$ sysctl -p
to see your current settings for dirty ratio, do the following:

$ sysctl -a | grep dirty
  vm.dirty_ratio = 40
  vm.dirty_background_ratio = 3

VM overcommit

Another important setting for the virtual memory management is the behavior of memory overcommitting. As you can see in number of cases when the systems were at high load, you will actually hit a point where you will run out of real memory and as Linux by default is quite freely giving away more memory that it really has, then once reaching the real limit of memory you have processes dying due to “Out of memory” errors which at least once also caused one of the systems to crash.

Basically what Linux does is that when someone asks for memory it will easily give that memory, even if it doesn’t have that much memory. It just assumes that because there is a lot of asking for memory more than processes really need, that it will not really run out of memory. There is also a setting in the Linux kernel which limits the overcommitting to a certain % of the total memory (vm.overcommit_ratio) and by default it is set to 50%. So when Linux is handling out memory to processes, it assumes it actually has 150% of the memory it really has and to be honest in most cases this is not a problem as the many applications are really greedy and ask for more than they really need.

However for a high throughput machine, there is a certain likelyhood to really run out of memory with real allocations and hence you might need to stop Linux kernel from handing out too much memory by setting the vm.overcommit_ratio setting to 2 (default is 0) which disables the overcommitment feature fully.

Again, just add the setting with its value at the end of /etc/sysctl.conf and let sysctl to set it based on the config file by:

$ sysctl -p
The settings in /etc/sysctl.conf are reread at every boot so these settings will remain persistent.

Enabling direct-io mode

One can tune GlusterFS to work for bigger block sizes when doing cp -a, tar -x, while mounting.

Do the following:

# glusterfs –direct-io-mode=write-only -f <spec-file> <mount-point>
NOTE: For NFS reexport to work, one need to nullify the effect of direct-io. Hence

# glusterfs –direct-io-mode=none -f <spec-file> <mount-point>

Tuning FUSE kernel module
Below mentioned are the high level changes done to the FUSE kernel modules by Gluster team to get the higher I/O throughput.

Change inode blocksize from PAGE_CACHE_SIZE to 1048576
Change d->blksize from 512 to 1048576
Change fc->bdi.ra_pages from (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE to 256
Change sb->s_blocksize from PAGE_CACHE_SIZE to 1048576
Change sb->s_blocksize_bits from PAGE_CACHE_SHIFT to 20
Download patched FUSE release with the above tuning from GlusterFS Tuned FUSE Source
Download the patch itself to FUSE release from GlusterFS Tuned FUSE Patch

I/O Scheduler Tuning
Under Linux Kernel: “deadline” I/O scheduler seems to be the best bet under Linux kernel.

The deadline scheduler implements request merging, a one-way elevator, and imposes
a deadline on all operations to prevent resource starvation. Because writes return
instantly within linux, with the actual data being held in cache, the deadline
scheduler will also prefer readers – as long as the deadline for a write request  
hasn’t passed. The kernel docs suggest this is the preferred scheduler for database
systems, especially if you have TCQ aware disks, or any system with high disk
performance.
For enabling this option please add elevator=deadline as a kernel parameter with editing the /boot/grub/menu.lst file and reboot the machine.

For more information please read /usr/src/linux-`uname -r`/Documentation/block/deadline-iosched.txt for fine grind tunables.

File System Tuning
On all linux filesystems is a good idea to use noatime fs option. This makes the filesystems faster as they will not lose time in updating the access time of all files and directories they use.

NOTE: These commands may format your disk, be sure to have a complete backup of your data to prevent data loss.

If you are using xfs try this command: mkfs.xfs -i attr=2 /dev/<hda>

if you are using ext3: mkfs.ext3 -I 256 /dev/<hda>

This will ensure that extended attributes are put into the inode structure itself. If this is not done, an extra block is allocated for EAs thus taking more time. It dramatically improved the performance for me (4 to 5 folds) for creation of files. reiserfs was fast by default, so it must be using in-inode EAs by default.

Most file systems come with a tuning utility and for ext3fs there is the tune2fs utility. Several parameters can be modified but perhaps the most useful parameter here is what size should be reserved and who should be able to take advantage of this which could help you getting more useful space out of your drives, possibly at the cost of less room for repairing a system should it crash.

EXT3

Ext3 filesystem is one of the most stable filesystems which Linux provides. If we can compare with ext2fs where the writes are scheduled every 30 seconds, for ext3fs it is only 5 seconds. Standard way of installing ext3fs with default elevator setting for write/read as 4096/8192 (this is common in most of the GNU/Linux distributions). Most of the distribution provide the application elvtune by which you can set the latency to still lower values (elevator settings)

$ /sbin/elvtune -r 1024 -w 2048 /dev/sdd (changes elevator settings on all the devices under sdd)
But recommendation will be to test various values on the requirement of the Applications you are going to benchmark. When you are done with the results and you have got the optimal values, you can add the same in /etc/rc.d/rc.local (check for other Distributions) so for every bootup the system setting are same. For more granular information please refer /usr/src/linux-`uname -r`/Documentation/filesystems/ext3.txt

Ext3 also has directory indexing which is very much useful for the directories which have larger files or many files by boosting the file access. This is done by generating hashed binary trees to store directory information. But by default if a directory grows beyond a single disk block its automatically indexed using a hash tree. To enable directory indexing use the following commands

$ tune2fs -O dir_index /dev/hda1
This command only applies to those directories created on the named filesystem after tune2fs runs. To apply directory indexing to existing directories, run the e2fsck to optimize and reindex the directories for the filesystem.

$ e2fsck -D -f /dev/hda1
Another Ext3 enhancement is preallocation. This feature is useful when using Ext3 with multiple threads appending to files in the same directory. You can enable preallocation using the reservation option.

$ mount –t ext3 –o reservation /dev/hda1 /ext3
You can further improve Ext3 performance by keeping the file system’s journal on another device. An external log improves performance because the log updates are saved to a different partition than the log for the corresponding file system. This reduces the number of hard disk seeks.

To create an external Ext3 log, run the mkfs utility on the journal device, making sure that the block size of the external journal is the same block size as the Ext3 file system. For example, the commands…

$ mkfs.ext3 –b 4096 –O journal_dev /dev/hda1
$ mkfs.ext3 –b 4096 –J device=/dev/hda1 /dev/hdb1
/dev/hda1 is used as the external log for the Ext3 file system on /dev/hdb1.

Reiser FS
The ReiserFS journaling file system supports metadata journaling, and has a unique design that differentiates it from the other journaling file systems. Specifically, ResiserFS stores all file system objects into a single b*-balanced tree. ReiserFS also supports compact, indexed directories, dynamic inode allocation, resizable items, and 60-bit offsets.

Like Ext3, the ReiserFS file system journal can be maintained separately from the file system itself. To accomplish this, your system needs two unused partitions. Assuming that /dev/hda1 is the external journal and /dev/hdb1 is the file system you want to create, simply run the command:

$ mkreiserfs –j /dev/hda1 /dev/hdb1
That’s all it takes.

In addition to an external journal, there are three mount options that can change the performance of ReiserFS: The hash option allows you to choose which hash algorithm to use to locate and write files within directories. There are three choices. The rupasov hashing algorithm is a fast hashing method that places and preserves locality, mapping lexicographically close file names to close hash values. The tea hashing algorithm is a Davis-Meyer function that creates keys by thoroughly permuting bits in the name. It achieves high randomness and, therefore, low probability of hash collision, but this entails performance costs. Finally, the r5 hashing algorithm is a modified version of the rupasov hash with a reduced probability of collisions. r5 is the default hashing algorithm. You can set the hash scheme using a command such as

$ mount –t reiserfs –o hash=tea /dev/hdb1 /mnt/reiserfs
There is another option called notail which disables the packing of files into the tree. By default, ReiserFS stores small files and “file tails” directly into the tree.

It is possible to combine mount options by separating them with a comma. Here’s an example that uses two mount options (noatime, notail) to increase file system performance:

$ mount –t reiserfs –o noatime,notail /dev/hdb1 /mnt/reiserfs

Performance Translators

Performance has always been the need for a Clustered Storage. So GlusterFS provides with many such performance enhancing translators which help boosting up the performance of the overall file system.

write-behind Translator

volume writebehind
  type performance/write-behind
  option aggregate-size 131072 # in bytes
  subvolumes brick1
end-volume

In general write operations are slower than read. The write-behind translator improves write performance significantly over read by using “aggregated background write” technique. That is, multiple smaller write operations are aggregated into fewer larger write operations and written in background (non-blocking). aggregate-size determines the block size till which write data should be aggregated. Depending upon your interconnect, RAM size and work load profile you should tune this value. Default of 128KB works well for most users. Increasing or decreasing this value beyond certain range will bring down your performance. You should always benchmark with an increasing range of aggregate-size and analyze the results to choose an optimum value.

read-ahead Translator

volume readahead                                                                    
  type performance/read-ahead                                                        
  option page-size 65536 ### in bytes                                                
  option page-count 16 ### memory cache size is page-count x page-size per file      
  subvolumes brick1                                                                
end-volume
Read-ahead translator pre-fetches multiple blocks of data in background to local cache. This dramatically improves performance for consecutive read operations. Also smaller read operations are aggregated into one single larger block to reduce network and disk I/O calls. page-size describes the block size and page-count describes amount of blocks to pre-fetch.

NIC Connection Speed

When Using normal ethernet Gib Interfaces for Gluster , Make Sure that you’re Gluster I/O Traffic goes trough a NIC with a Full Duplex 1000 Mbps connection. Even though this is a standard in today’s network topologies , some hosting solutions doesn’t provide this out of the box . Performance degradation ( especially read ) will be of the same scale when using slower connection speeds.

To view your Nic connection Speed , just use :

ethtool ethX # ethX wil be NIC used for Gluster I/O traffic

ib-verbs vs ib-sdp vs tcp

ib-verbs driver provides the lower level Infiniband verbs API transport. ib-sdp driver provides Sockets Direct Protocol (socket interface of RDMA) transport. tcp uses regular TCP/IP session transport.

ib-verbs provides the lowest latency of all the transports, around 1 to 4 micro seconds. ib-sdp is around 7 to 9 micro seconds and tcp is around 70 to 90 micro seconds. If you are looking for really fast storage, you should seriously consider Infiniband, as it is increasingly becoming a commodity. Regular Gigabit Ethernet transport itself is good enough for most users. If you are clustering 4 nodes with 2 Gigbit Ethernet NICs, you’ve already got 8 Gbps aggregated bandwidth.

서진우

슈퍼컴퓨팅 전문 기업 클루닉스/ 상무(기술이사)/ 정보시스템감리사/ 시스존 블로그 운영자

You may also like...

1 Response

  1. coinify 말해보세요:

    Reading your article helped me a lot and I agree with you. But I still have some doubts, can you clarify for me? I’ll keep an eye out for your answers.

페이스북/트위트/구글 계정으로 댓글 가능합니다.