How to optimize Ganglia for too much Disk IO

I was facing ganglia server performance issues. It was showing too much disk IO and too much Load average on the server. So I thought of some optimization. All of my setup is on AWS and I was using 10000 IOPS EBS volume for my server but still was facing disk performance issues.

My Server Details:
8 CPU – 2.1 GHZ
RAM – 8GB
No. of hosts graphing – ~150

Current Load on server:
avg-cpu: %user %nice %system %iowait %steal %idle
9.88 0.00 9.11 20.02 1.28 59.72

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
xvdap1 0.00 8.00 0.00 13.00 0.00 0.33 51.32 0.09 7.02 1.11 1.44
xvdh 0.00 473.00 98.60 9700.00 0.39 40.39 8.52 145.26 14.82 0.10 100.00

I have used RRDCache for disk IO improvement. Below are the steps to setup.

* Uninstall your existing rrdtool rpm and install it from rpmforge or download them from apt.sw.be site as below:

rpm -e rrdtool --nodeps
rpm -e rrdtool-perl --nodeps
wget http://apt.sw.be/redhat/el5/en/x86_64/dag/RPMS/perl-rrdtool-1.4.7-1.el5.rf.x86_64.rpm
wget http://apt.sw.be/redhat/el5/en/x86_64/dag/RPMS/perl-rrdtool-1.4.7-1.el5.rf.x86_64.rpm
wget http://apt.sw.be/redhat/el5/en/x86_64/dag/RPMS/rrdtool-devel-1.4.7-1.el5.rf.x86_64.rpm
yum install -y perl-Time-HiRes
yum install -y libdbi
yum install -y xorg-x11-fonts-Type1
rpm -ivh perl-rrdtool-1.4.7-1.el5.rf.x86_64.rpm perl-rrdtool-1.4.7-1.el5.rf.x86_64.rpm rrdtool-devel-1.4.7-1.el5.rf.x86_64.rpm

* Since gmetad runs as ganglia user and rrdcached requires access to write to rrd files and apache needs access of same directory. So, Add ganglia to apache group.

usermod -a -G apache ganglia

* Give apache group access to rrd dir. In my case, I am saving RRDs to /ganglia partition. By default its /var/lib/ganglia.
chown -R ganglia:apache /ganglia/rrds/

* Now change the rrdcached startup options:

vim /etc/sysconfig/rrdcached
OPTIONS="rrdcached -s apache -m 664 -l unix:/tmp/rrdcached.sock -s apache -m 777 -P FLUSH,STATS,HELP -l unix:/tmp/rrdcached.limited.sock -b /ganglia/rrds -B"
RRDC_USER=ganglia

* Also update the gmetad startup variables to use rrdcache socket file.

vim /etc/sysconfig/gmetad
RRDCACHED_ADDRESS="unix:/tmp/rrdcached.sock"

* Now you have to tell ganglia web to read from socket file.

Change the below variable with the socket file location. By default it does not have any value.

$conf['rrdcached_socket'] = "/tmp/rrdcached.sock";

* Always make sure, rrdcached is started before gmetad process. So

/etc/init.d/gmetad stop
/etc/init.d/rrdcached start
/etc/init.d/gmetad start

* Now monitor your logs ganglia logs:

tail -f /var/log/messages

If you don’t see any error like: “Unable to connect to rrdcache: No such file or directory” then you are good to assume your rrdcache setting are correct.

Now check “ps aux | grep -i rrdcached“, if you see couple of rrdcache processes are running with ganglia user. You are good.

By checking both the above commands you can consider rrdcache is working fine.

* Alternatively you can check if RRDcache is working on not with:

while true; do clear ; echo STATS | socat - /var/rrdtool/rrdcached/rrdcached.sock; sleep 1; done

Now you should check your iostat command output again and see the IO difference. It would have decreased by atleast 10 times. 🙂

There is another method to improve the ganglia performance by moving the RRD dir to tmpfs. As suggested: here:

I'm happy to use Increase Sociability.