My MySQL tips

The Galera installation guide for dummies.

The Galera series

Overview and some history

Some years ago (2011), I was presenting in one article what at that time I have call “A dream on MySQL parallel replication” (http://www.tusacentral.net/joomla/index.php/mysql-blogs/96-a-dream-on-mysql-parallel-replication).

At that time I was dreaming about having a replication mechanism working on the commit operations, and was also trying to explain how useless is to have parallel replication by schema. I have to say that I got a lot of feedback about that article, directly and indirectly, most of it telling me that I was crazy and that what I was looking for cannot be done.

The only crazy guy that was following me in my wandering was Herik Ingo, who mentions to me Codership and Galera.

After few months a customer comes to me describing a scenario that in my imagination, would have be perfect for the cloud and an extension of MySQL call Galera.

At that time I was not really aware of all the different aspect of Galera, but I give it a try. On November 2011, I had done the first POC and start to collect valuable information about how to play with Galera.

It was my intention to write an article presenting it, but I was really too busy and the article at the end was unfinished (http://www.tusacentral.net/joomla/index.php/mysql-blogs/119-galera-on-red-hat-is-123-part-1).

Instead I submit to MySQL connect 2012 the first version of a presentation about this POC, POC that was only the first of much longer list (http://www.slideshare.net/marcotusa/scaling-with-sync-replication-2012).

From September 2012 to April 2013, we have seen Galera and the work coming from Codership, becoming more and more known in the community, this also thanks to Percona and people like Jay Jansen, or support coming from FromDual and tools from Several Nines.

In April 2013 in Santa Clara I attend the Jay’s tutorial, also to see how others were dealing with what for me had become in the last 2 years, a must use.

If you have lost it, here the link and please review the material, Jay is a great professional, and his tutorial was awesome (http://www.percona.com/live/mysql-conference-2013/sessions/percona-xtradb-cluster-galera-practice-part-1).

At the same conference I presented the updates of the POC done with some additional numbers, statistics, and different binaries, in fact I moved from MySQL/Oracle InnoDB to XtraDB.

Keeping in mind that we still talk about 5.5 because Galera is still not 5.6 productions ready, the difference was significant. A lot of inconsistent behaviour in thread handling that I had suffers with standard InnoDB, were not present in XtraDB.

As for today after more then two years from that initial article, we have many Galera installations around, some of them used in very critical systems.

This is thanks to the continuous work of the Codership people, and to the others that had believe in them, people like Herik Ingo, Oli Sennauser (FromDual), Johan Anderson (Several Nines), Jay Jansen (Percona) and myself.

Galera is becoming more and more a solid and trustable product, it still has some issue here and there, but the Codership team is quite efficient in fixing them when tracked down.

ANYHOW I have to make a note here, and I am talking to all the company who are taking advantage out of Galera. Please remember that open source doesn’t mean free, people still has to pay bills, and Codership developers cannot provide and sustain Galera if you don’t support them.

To buy Codership support, it is a very efficient way to get a quality service from the software producer, and at the same time to guarantee the future development of a product that allow you to make business.

Architecture

Now let start to talk about it.

After many installation and different combination of blocks, the following is for me the solution that identify the most flexible, scalable and solid solution to implement a MySQL cluster using Galera.

Normally if a customer asks me advice the following is my answer:

galera

"The solution sees the HAProxy instance hosted directly on the application machine, the HAProxy then connect the application to the data layer rotating on them.

Given Codership had provide a check for HAProxy to recognize the status of the node in relation to his role, HAProxy is able to skip the MySQL nodes when those are in recover or Donor mode.

Our tests had shown that this approach is the most scalable in relation to Application/data connection, and at the same time is the one that reduce the impact to minimum, given each application tile is independent.

About MySQL this solution allow the data layer to scale both reads and writes. The scalability is anyhow limited by the amount of write and related bytes, which the data layer must share between nodes.

This is it; each MySQL node has a full copy of the dataset, as such each write or delete must be applying to all nodes.

The Galera replication layer is efficiently managing this effort, but efficiency is still dependant by the Data Storage and Network layer.

It is possible to have better understanding if this solution fulfils the requirements, calculating the real amount of traffic generated, and performing a projection.

Also a proof of concept is always suggested to validate the application functions against the MySQL/Galera cluster.

Pro

Solution scales write and read, it also allow to have no service interruption when using a proxy solution like HAProxy which redirect the traffic In less then 5 seconds.
MySQL nodes can be access at all times for read and write operation.
Application can access database directly if needed, or can be configure as a tile with the HAProxy for better scalability.
Specific check is provided to identify the internal status of MySQL/Galera node.
The solution use InnoDB as storage engine, as such it will behave in a well known way, in responding to Read operations.
This solution can scale out and scale in, quite easily, given that to scale out we just need to add a even number of MySQL servers, to an odd cluster.
To scale in is just a matter to remove the nodes, from HAProxy and then turn the MySQL off.

Cons

Data is not partitioned cross MySQL nodes, it is fully replicated on all the MySQL, as such a lot of space on disk will be used, (Data x Number of nodes) + (Binary logs size x number of nodes).
When a node is recovering it will require a donor node, this will reduce the capacity of the Cluster of the failed node + the Donor. In case of a 3 nodes MySQL cluster, only one node will remain active, given that the recommended minimum number nodes on busy environment is five nodes.
The solution has being tested on physical machines, Amazon EC2, and within different Zones, but it will require dedicated network cross-zone to prevent delay.

Minimum requirements

The minimum number of MySQL nodes for a cluster is 3, but if the application is critical to reduce possible issue when recovering a node, a cluster of 5 is strongly suggested.

Note that for quorum reasons the number of server must be odd.

Network between the nodes must be reliable and with low latency.

Best usage

Applications that require having write distribution and some scalability, with medium load of writes per second, and constant grow of the dataset size.

Uptime in nines

99. 995% that correspond to 26 minutes downtime per year.

Solution Availability

MySQL with Galera is a GA solution, so no cost in implementing it.

It is good practices to have a support contract with Codership as software provider, especially to have better reaction in case of bugs or feature requests."

Amen!

Implementation

Once you have identify your requirements, and dimension the machines (or cloud instances) that will host the MySQL Galera implementation, be sure to have one network channel to serve the MySQL-Application traffic, and a different one to serve the Galera replication layer, and a very efficient data storage layer.

To get help and advice on that you can contact me any time.

Basic requirements

Packages:

Xinetd

rsyslog

openssl-devel.x86_64

sysstat

iotop

netcat

htop

oprofile

Perl

Perl DBI

Perl DBD::mysql

Perl Time::HiRes

accepting network traffic from/to ports 3306 3307 3308 3311

HAPROXY for:

RH5: http://rpm.pbone.net/index.php3/stat/4/idpl/17082875/dir/redhat_el_5/com/haproxy-1.3.26-1.el5.x86_64.rpm.html

RH6: http://rpm.pbone.net/index.php3?stat=26&dist=74&size=440708&name=haproxy-1.4.8-1.el6.x86_64.rpm

MySQL Galera From Percona:

MariaDB Galera implementation

Codership (Oracle) Galera implementation

Configuration

The initial configuration is normally base on 5 or 7 MySQL galera node listening each other on the internal network.

The HAProxy will check MySQL using the code delivered by Codership which is recognizing the Galera state.

This check will be installed ONLY on the MySQL/Galera node usign port 3311, so no configuration is due in the HAProxy node.

The MySQL frontend will connect to application using HAProxy, each Application node will have his own HAProxy installation.

Each Application will then locally connect to HAProxy (127.0.0.1:3307) HAProxy will redirect the connection to final target.

The connections are distributed using RR (Round Robin) and are non persistent.

That is once the TCP/IP connection is close the same Application Node will have no guarantee to access the same server.

PORTS:

HAProxy will listening on port 3307
HAProxy will show status using HTTP client on port 3308
MySQL will be listening on port 3306
MySQL check for HAProxy will operate on port 3311

MySQL:

adjust the standard parameter to fit the hosting environment.

redflag

NOTE!! for innodb_io_capacity Run FIO or IOZONE to discover the real IOPS available and adjust the parameter to that value, just guessing is not what you want in production.

Galera:

In line of principle the WSREP settings are tuned during the POC, but possible fine-tuning could be require when moving to production.

I advise you to spend some time to check and tune the following.

wsrep_max_ws_rows= 131072
wsrep_max_ws_size= 2147483648
wsrep_slave_threads= 48
wsrep_provider_options="gcache.size=10240M;
			evs.send_window=512;
			evs.user_send_window=512"

Parameters to keep an eye on are the send/receive queue and the GCACHE.SIZE.

About this there is something that must be clarify and why is very important to set it large enough.

First of all you should understand that when a node become a DONOR the server will not be accessible for write operation, as such it will be removed by HAProxy from the pool until the node has finished to feed the JOINER.

redflag

Galera has two ways of synchronizing a starting or recovering node.

IST and SST.

IST

When performing a synchronization with IST, Galera will send over to the resarting node ONLY the information present in the GCache, this can be see an INCREMENTAL update.

For instance, if you have a cluster of 5 nodes and for maintenance reasons you need to put them down on rotation, the node that will remain down will loose a set of operation during the maintenance time.

When you start back Galera read the last position the node has locally registered, and will request from the donor to start from there. If the DONOR still has that position in the GCache it will send to the restarting node the data from there.

This operation is normally much faster and has very limited impact also on the DONOR.

SST

This is a rebuild from scratch; normally it applies when the node is started the first time, and/or when it crashes.

The operation can be very time consuming when talking of dataset of some consistencies.

There are several methods that can be used for SST, from mysqldump to Xtrabackup. I have choose almost always to use the Xtrabackup, which is very well integrated in the Percona Galera distribution and guarantee performance and data safety.

But obviously when you are in the need to backup several hundreds of gigabytes, the process will take some time. Or if you have a very high level of inserts and say one or two hundreds of gigabytes, again the combination of time and datasize will be fatal.

The main point is that in these cases the time Galera will take the DONOR down in order to backup and trasmit the data to the JOINER, will be too long after for the DONOR node to recover from his Gcache once finish the operation, transforming the DONOR in an additional JOINER.

I have being observing this on cascade effect on several cluster not configured correctly, in relation to their traffic and data size.

Clusters of 7 or more nodes, just going on hold because the nodes were not able to store enough information on gcache. It is true that when Galera is left with one node, given it is in DONOR mode it stops to write allowing the cluster to heal itself. But it is also tru that this could be a very long operation and in production is quite distruptive.

So what to do? Easy just calculate before what is the worse scenario for you, then dimension the GCache to be at least 1/3 bigger then that is not more. Try to be safe, and stay on IST, this very important if you have a very high level of writes.

What I do is that Gcache must be large enough to guarantee modification statements for the double of the time needed to take a full backup.

IE.

With five node, and a set of binary log of 20GB per day.

If a full backup with XTRABACKUP takes 6Hrs the GCACHE size should be:

GCache = ((BS x (tb/hrsDat )) x Mn) x 2

GCache = ((20 x (4/24)) x 5 )* 2 = ~33.3GB

BS - size of the binlog

Tb - Time for the backup in hours in a day

Mn - MySQLGalera nodes

hrsDat Hours in a day (24)

This should be sufficient to have a decent amount of time and space to be safe.

redflag

Finally rememeber that Galera with Xtrabackup REQUIRE perl with DBI DBD::mysql in place or synchronization will fail!

Main steps

1) configure the environment

- install xinetd (if not present)

- create user

- create directory layout

- download software for MySQL/Galera

- Install HAProxy

2) Deploy the first MySQL Galera node

- create the basic mysql database

- create basic grants

- test and cleanup the other accounts

3) Deploy all other nodes

4) Deploy HAProxy

- review configuration and change in relation to the network

- start HAProxy

5) Test connection from client to HAProxy

6) Perform test to validate the installation

7) Load data set

Step by step

configure the environment

1) Install basic tools, if cusomer agreed:

yum -y install htop/sysstat/screen/xinetd/haproxy/iotop/nc

rpm -Uvh http://mirrors.kernel.org/fedora-epel/6/i386/epel-release-6-8.noarch.rpm

rpm -Uvh http://mirrors.kernel.org/fedora-epel/5/i386/epel-release-5-4.noarch.rpm

2) If you like run inside a screen so if you need you can detach and not interfere with the installtion process

screen -h 50000 -S MySQLINSTALL

3) Check if ANY MySQL daemon is present (RPM) and running, in case remove it

ps aux|grep mysql
rpm -qa |grep -i MySQL
rpm -e --nodeps Or -ev --allmatches
rpm -qa |grep MySQL
#Remove OLD
for bin in 'ls -D /usr/local/mysql/bin/'; do rm -f /usr/bin/$bin; done
for lib in 'ls -D /usr/local/mysql/lib/libmysqlclient*'; do rm -f /usr/lib64mysql/$lib; done
for bin in 'ls -D /usr/local/xtrabackup/bin/'; do rm -f /usr/bin/$bin; done

3) create user and the directory structure

userdel mysql
rm -fr /home/mysql/
groupadd mysql
useradd -m -g mysql -s /bin/bash -m -d /home/mysql mysql
passwd mysql
mkdir/opt/mysql_templates/
ln -s /usr/local/mysql -> /opt/mysql_templates/

IN case of binary use

5) Download the mysql version

wget http://www.percona.com/redir/downloads/Percona-XtraDB-Cluster/LATEST/binary/linux/x86_64/Percona-XtraDB-Cluster-5.5.30-23.7.4.405.Linux.x86_64.tar.gz
tar -xzf Percona-XtraDB-Cluster-5.5.30-23.7.4.405.Linux.x86_64.tar.gz
wget http://www.percona.com/redir/downloads/XtraBackup/LATEST/binary/Linux/x86_64/percona-xtrabackup-2.1.3-608.tar.gz
tar -xzf percona-xtrabackup-2.1.3-608.tar.gz

6) Create symbolic links to /usr/local

ln -s /opt/mysql_templates/Percona-XtraDB-Cluster-5.5.30-23.7.4.405.Linux.x86_64 /usr/local/mysql
ln -s /opt/mysql_templates/percona-xtrabackup-2.1.3 /usr/local/xtrabackup

7) Create symbolic links to /usr/bin

#Install new
for bin in 'ls -D /usr/local/mysql/bin/'; do ln -s /usr/local/mysql/bin/$bin /usr/bin/$bin; done
for bin in 'ls -D /usr/local/xtrabackup/bin/'; do ln -s /usr/local/xtrabackup/bin/$bin /usr/bin/$bin; done

#Set security
for bin in 'ls -D /usr/local/mysql/bin/'; do chmod +x /usr/bin/$bin; done
for bin in 'ls -D /usr/local/xtrabackup/bin/'; do chmod +x /usr/local/xtrabackup/bin/$bin /usr/bin/$bin; done

8) Move the service script from the original directory

mv /usr/local/mysql/support-files/mysql.server /etc/init.d/mysqld or /etc/init.d/mysql/mysqld

Edit the file filling the basedir and datadir variables, this is not always needed.

9) Edit my.cnf to match the path and set initial values

10) Reset security

chown -R mysql:mysql /opt/mysql_templates
chmod +x /usr/local/mysql

Configure HAProxy check

Require:

xinetd

/usr/bin/clustercheck

the file clustercheck comes directly with the Percona distribution, you just to be sure that is in the path for the execution.

1) set haproxy check

prepared file (mysqlchk):

vi /etc/xinext.d/mysqlchk

# default: on
# description: mysqlchk
service mysqlchk
{
# this is a config for xinetd, place it in /etc/xinetd.d/
disable = no
flags = REUSE
socket_type = stream
port = 3311
wait = no
user = nobody
server = /usr/bin/clustercheck
log_on_failure += USERID
only_from = 0.0.0.0
# recommended to put the IPs that need
# to connect exclusively (security purposes)
per_source = UNLIMITED
}

2) check for free port

cat /etc/services |grep 3311

add service mysqlchk /etc/services

echo "mysqlchk 3311/tcp # mysqlchk" >> /etc/services

3) add to /etc/xinetd.d/ the configuration for mysqlchk services

restart xinetd

4) Check it

telnet 127.0.0.1 3311
	Trying 127.0.0.1...
	Connected to localhost.localdomain (127.0.0.1).
	Escape character is '^]'.
	HTTP/1.1 200 OK
	Content-Type: text/plain
	Connection: close
	Content-Length: 40
	Percona XtraDB Cluster Node is synced.
	Connection closed by foreign host.

Perl setup

You should do this the way you are more comfortable, anyhow be carefull on not doing double installation between yum/apt-get and cpan. These two way by default install library in different places, and will give you a nightmare in cleaning the mess and library conflict.

Be sure to have DBI and DBD installed where DBD::mysql should be version perl-DBD-MySQL-4.019 or newer.

Deploy the first MySQL Galera node

1) Create initial MySQL database FOR BINARY INSTALL ONLY:

su - mysql
cd /usr/local/mysql/
./scripts/mysql_install_db --defaults-file=/etc/my.cnf

Carefully check the output you should see OK twice, if not check the error log.

2) Start mysql

/etc/init.d/mysqld start --wsrep_cluster_address=gcomm://

Check the error log for possible errors

tail -fn 200

3) connect for the first time and change security

mysql -uroot
set PASSWORD for root@'localhost'=PASSWORD('secret');

Grant access for xtrabackup

CREATE USER 'sstuser'@'localhost' IDENTIFIED BY 'mysqlsst';
GRANT RELOAD, LOCK TABLES, REPLICATION CLIENT ON *.* TO 'sstuser'@'localhost';
GRANT RELOAD, LOCK TABLES, REPLICATION CLIENT ON *.* TO 'sstuser'@'192.168.1.%';

mysql> FLUSH PRIVILEGES;

Grant access to haproxy checks

CREATE USER 'clustercheckuser'@'localhost' IDENTIFIED BY 'clustercheckpassword!';
GRANT PROCESS ON *.* TO 'clustercheckuser'@'127.0.0.1';
GRANT PROCESS ON *.* TO 'clustercheckuser'@'192.168.1.%';
Flush privileges;

Remove generic users:

DROP user ''@'localhost';
DROP user ''@'
Grant all on *.* to dbaadmin@'localhost' identified by 'secret'

4) collect statistics and informations:

SHOW GLOBAL STATUS/VARIABLES;
STATUS;

5) Stop server

6) restart server

/etc/init.d/mysqld start --wsrep_cluster_address=gcomm://

Deploy all other nodes

On each node:

1) modify the server identification in the my.cnf

wsrep_node_name=pchgny1 <------------
server-id=1 <----------------

2) start the node checking the mysql log

/etc/init.d/mysqld start

Deploy HAProxy

Connect on the appliction servers and perform the HAProxy installation.

wget the HAProxy package related to the host OS
rpm -iUvh haproxy-1.4.22-4.el6_4.x86_64.rpm

1) Set the configuration file on each HAProxy node

redflag In line of principle HAProxy is quite efficient to monitor and report the status of the nodes on his HTML interface when using the HTTP protocol, this is not true when using the TCP.
Given that, I was using the trick to use the HTTP protocol on a different port, just with the scope of reporting.

#---------------------------------------------------------------------
# Global settings
#---------------------------------------------------------------------
global
    # to have these messages end up in /var/log/haproxy.log you will
    # need to:
    #
    # 1) configure syslog to accept network log events.  This is done
    #    by adding the '-r' option to the SYSLOGD_OPTIONS in
    #    /etc/sysconfig/syslog
    #
    # 2) configure local2 events to go to the /var/log/haproxy.log
    #   file. A line like the following can be added to
    #   /etc/sysconfig/syslog
    #
    #    local1.*                       /var/log/haproxy.log
    #
   log 127.0.0.1   local1 notice
 
    maxconn     40096
    user        haproxy
    group       haproxy
    daemon
 
    # turn on stats unix socket
    #stats socket /var/lib/haproxy/stats
 
 
 
#---------------------------------------------------------------------
# common defaults that all the 'listen' and 'backend' sections will
# use if not designated in their block
#---------------------------------------------------------------------
defaults
    mode                    http
    log                     global
    option                  tcplog
    option                  dontlognull
    option                  redispatch
    retries                 3
    maxconn 4096
    contimeout 160000
    clitimeout 240000
    srvtimeout 240000
 
 
 
 
#---------------------------------------------------------------------
# common defaults that all the 'listen' and 'backend' sections will
# use if not designated in their block
#---------------------------------------------------------------------
 
listen stats 0.0.0.0:3308
    mode http
    stats enable
#    stats hide-version
    stats scope .
    stats realm Haproxy\ Statistics
    stats uri /haproxy?stats
    stats refresh 5s
#    stats auth xxxxx:xxxxx
    option contstats
    option httpchk
#        option mysql-check user test
 
    server node1 10.5.1.6:3306 check port 3311 inter 2000 rise 3 fall 3
    server node3 10.5.1.8:3306 check port 3311 inter 2000 rise 3 fall 3
    server node4 10.5.1.9:3306 check port 3311 inter 2000 rise 3 fall 3
    server node5 10.5.1.10:3306 check port 3311 inter 2000 rise 3 fall 3

 
 
listen galera_cluster 0.0.0.0:3307
#Logging
	log global
	option dontlognull
#mode
    mode tcp
# balancer
    balance roundrobin
 
#options
#	option abortonclose
#	option forceclose
	option clitcpka 
	option tcpka
    option httpchk
#        option mysql-check user test
 
    server node1 10.5.1.6:3306 check port 3311 inter 2000 rise 3 fall 3
    server node3 10.5.1.8:3306 check port 3311 inter 2000 rise 3 fall 3
    server node4 10.5.1.9:3306 check port 3311 inter 2000 rise 3 fall 3
    server node5 10.5.1.10:3306 check port 3311 inter 2000 rise 3 fall 3

2) add logging

to add logging using rsyslog

vim /etc/rsyslog.conf

Modify enabling, the following:

# Provides UDP syslog reception
$ModLoad imudp.so
$UDPServerRun 514

and add

#HAProxy log
local1.* /var/log/haproxy.log

Finally restart rsyslog

/etc/init.d/rsyslog restart

2) start HAProxy

/etc/init.d/haproxy start

3) check sever status using the web interface

Using a web browser check from:

http://:3308/haproxy?stats

You will see, or you SHOULD see, the HTML page reporting the status of your nodes.

Quick check for the connection

Connect to MySQL using mysql client and simple whatch to cycle the servers.

watch -n1 -d 'mysql -udbaadmin  -p  -h  -e "Show global status"| grep -E

"wsrep_ready|wsrep_last_committed|wsrep_replicated|wsrep_received|wsrep_local_commits|wsrep_local_cert_failures|wsrep_local_bf_aborts|wsrep_local_send_queue|wsrep_local_recv_queue|wsrep_local_state_comment"'

To see how HAProxy redirect the connections from the APPLICATION NODE:

watch -n1 -d 'mysql -h127.0.0.1 -P3307 -u -p -e "Show global variables"| grep -i -E "server_id|wsrep_node_name"'

You will see the values changing at each request.

Try to put down one node and see what happen on the web interface of HAProxy and at the running command.

If all is fine it will be quite fun to see how easy and fast it manage the shutting down node.

POC steps

Finally this is just an example of what we do cover when doing the POC, it obviously vary from customer to customer.

POC Tests

Functional tests:

1) light data load/read on all nodes

- perform loading on all nodes

- perform data read on all nodes

2) Query/inserting while one node is failing

- perform selects an all nodes

- perform inserts on all nodes

Expectations:

No service interruption

No difference in the result sets between nodes.

3) Query/inserting while node is recovering

- perform selects on all data nodes

- perform inserts on all data nodes

- identify which node become the donor

Expectations:

minimal service degradation because Donor node will not be available

No service interruption

No difference in the results set

Recovery perform by IST if insight the boundary of the cache

Recovery perform by SST if bigger then cache

Performance/capacity tests (including difference in using ONE single node, Three to seven nodes, full capacity):

1) Execute increasing load from application node modification only (using simple light insert like real time application; more complex inserts like heavier records, batch insert, multiple linked tables)

- perform increasing inserts and delete using from 4 to 96 threads each application block (2 block for each application)

Expectations:

- validate the maximum load limit

- identify the lag in the Galera replication, if any, when under heavy stress

Mesure:

- IO

- memory usage

- Internal MySQL/Galera metrics

- threads contention

2) Execute increasing mix load read/write

- perform increasing selects/inserts and delete using from 4 to 96 threads each application block

Expectations:

- validate the maximum load limit

- identify the lag in the Galera replication, if any, when under heavy mix read and write stress

Mesure:

- IO

- memory usage

- Internal MySQL/Galera metrics

- threads contention

3) Perform node data crash and recovery while heavy load running

Same test as 1.2 and 1.3

4) Optimizations:

1) Implement partitioning on table and execute historical archiving

- Test impact optimization in accessing data

- Test impact in managing the partitions (optimizing/defragmenting/drop)

Additional article about galera

There are few forthcoming articles I am writing following the same serie:

Galera understanding what to monitor and how

Galera tests and numbers, what I have prove is it possible to achieve in numbers and graphs.

Reference

Jay (Percona)

http://www.mysqlperformanceblog.com/2012/11/20/understanding-multi-node-writing-conflict-metrics-in-percona-xtradb-cluster-and-galera/

http://www.mysqlperformanceblog.com/2013/05/02/galera-flow-control-in-percona-xtradb-cluster-for-mysql/

https://www.percona.com/live/mysql-conference-2013/users/jay-janssen

Oli (FromDual)

http://www.fromdual.com/mysql-and-galera-load-balancer

http://www.fromdual.com/galera-load-balancer-documentation

http://www.fromdual.com/unbreakable-mysql-cluster-with-galera-and-lvs

http://www.fromdual.com/switching-from-mysql-myisam-to-galera-cluster

http://www.fromdual.com/galera-cluster-nagios-plugin-en

Codership

http://codership.com/

http://www.codership.com/wiki/doku.php?id=mysql_options_0.8

http://www.codership.com/wiki/doku.php?id=galera_parameters

http://www.codership.com/wiki/doku.php?id=galera_status_0.8

http://www.codership.com/wiki/doku.php?id=flow_control

http://www.codership.com/wiki/doku.php?id=galera_arbitrator

http://www.codership.com/wiki/doku.php?id=sst_mysql

http://www.codership.com/wiki/doku.php?id=ist

Several Nines

http://www.severalnines.com/clustercontrol-mysql-galera-tutorial

http://www.severalnines.com/blog/migrating-mysql-galera-cluster-new-data-center-without-downtime

http://www.severalnines.com/galera-configurator/

What I think about the Percona Live conference 2013.

mysql-2013-6934_Small

No need to say that as many others I have enjoy this conference a lot.

For me was also a personal success because I was finally able to have my company sponsoring the event and bring new members of my team to the conference as speakers, obviously all their merit, but I was happy as responsible of the MySQL cluster in Pythian knowing how much have cost to them to be there.

We also had a lot of people at the community dinner, last head count I did was over 120 people. In this regard I have to clarify some misunderstanding and confusion we had there.

Pedro’s requests to us to have the last head count by Tuesday morning, given that I have closed the count from the comments in the web-site and from the emails around noon. The number I pass to Pedro’s base on that was of 70-80 people, but we had additional people registering after that time and also last moment show, given that we have to manage additional 40 people that has to be located in different areas. Quite obvious that I did not give clear enough instructions, I personally apology for that, I (we) will do better next year. I hope you had good time and good food also if a little bit detach from the other tables.

What about the conference?

My feeling is that this event consolidates what is the “core” of the MySQL community.

We have seen many companies providing the same service, one close to the other sitting and talking with positive spirit and attitude. I have been personally chatting all the time with people from SkySQL, Percona and others, all of us with open and friendly attitude.

I have seen Oracle people participate to the conference (hurray!!!), as IOUG committee member I know very well the number of time we have said to Oracle to be there, and they were! This was good, period.

MySQL where are you going?

In relation of what is happening to MySQL, and where is that leading us, I have confirm my idea that nowadays we are not talking anymore of LAMP or Full stack, when we talk about MySQL.

What customers, companies and users are expecting is a more complex and articulate environment. Interaction between different blocks is now not an optional but a fact.

When we approach an existing environment or when we have to build a new one, we now think in term of hundreds of application servers, terabytes of data to store or to process, many different client platform to support and impressive amount of data to analyze for reporting.

Feel free to fill the boxes with the name of the product that you like most, but for sure you will not limit yourself to MySQL or LAMP.

Already today I have customers using MySQL, Oracle Database, MongoDB, Hadoop and more, all in one single connected environment.

Thinking in term of MySQL only when we think to product, service, monitoring or design is too limitative.

For instance a tools that monitor MySQL but do not catch his interaction with other element like Hadoop, is going to provide only part of the picture. That partial picture referring only to MySQL metrics will be close to be useless because it will not be able to provide all the require information needed to perform a valid analysis and eventually projection.

In other terms will cover the basic of the single block and will not help us to get the big picture. That is it, still useful to keep the block in decent state but will let you blind for what is going on in the whole context.

This is for many aspects true also for the products, each block, MySQL included, must become more and more flexible to exchange data with the others. This can be achieve developing specific interfaces, or by defining a common protocol/format of communication that is shared between the different blocks.

In MySQL universe (or MariaDB), this also means to keep consistency and to remain as open as possible to facilitate the creation of additional plug-in/engines.

But what really worries me, given also is “my” field, is “service”. Those environments require support, design and so on. We know very well how complex a MySQL environment could be, what about it when we start to have many other actors involve? What really scares me is the level of knowledge is required to cover all of them or just a segment.

I am convince that we will have to work around it, because users/customers/companies will ask us to provide the support for all the element in their architecture, actually it is already happening, and the real risk is to have or become generalist instead of high profile experts.

If you don’t do it, if you do not differentiate, the risk is to be isolate by the market, and yes be very smart on that specific area, but not able to understand the big picture, ergo useless.

On the other hand, trying to do too much could drives you (as company) to disperse the resource and have an average level, that is good but not Excellent with capital “E”.

The possible solution is to have a huge monster with hundred and hundred of people, and division per technology … and … well I have seen already several of them starting and die. Starting good with very high service quality and then become big, heavy and so slow that customers moves out.

No that is not the solution, solution reside in being able to balance correctly what you can do with your resources, and reasonable growth, and what not.

I am working in a multi technology company, and I know very well of what I am talking about when I say that balance is the key. The future will need to see two things: one is the companies improving their capacity to cover more then just MySQL, the other is open the space to collaboration, company covering different technologies must start to interact more, and offer better service and results to the users/customers in a cooperative way.

That will allow the single company to remain focus on few things and keep a high level of expertise on the chosen areas. Working in a cooperative way is the key.

All this needs to happen, and require coordination.

Flexibility and coordination are the keywords for the future. The MySQL community have shown already how much energy it has, how strong it could be in difficult moment and how much we really care about our customer/users.

What I see for the future is us working all together gathering all the actors involve, and give life to a new ecosystem which will help to facilitate the evolution of the next generation of data and applications.

What about the Speeches

In term of talks I have to say I was expecting a little bit more, not from the speakers only (me included), but also from the product companies.

As said I think is time to move to the next step and I was expecting more talks about interactions between technologies.

I am not saying that we should not cover the base; we must do it, but having more talks on how MySQL and MongoDB coexist, or how we could help Terabytes of data to be process between A and B; well that would have be nice.

Not only as what we have now, but also what we are planning for the future, including new features and ideas for the development.

In this regards the only relevant speeches I have seen were, the ones done by my colleague Danil Zburivsky on Hadoop/MySQL , and the other about Json by Anders Karlsson during the MariaDB/SkySQL event.

Thanks guys you see the future, and shame on Marco that was thinking about it and could have done it but did not … may be for the next conference.

Said that, the level of the speeches was good, I have being talking with the people attending, most of them satisfy, but let us wait and see what the evaluations will reveal.

What I can say is that I really enjoy the tutorial on Xtradb and Galera done by Jay Janssen, that helps me to feel less alone in the Galera implementation adventure; and I regret to have miss the “InnoDB: A journey to the core” by Davi Arnaut and Jeremy Cole. But I was presenting at the same slot, and would have not be nice for me to say to the people there, ok let us move all to the next room.

What about the Expo

WoooW, first time as sponsor and first time with a boot. A lot of talk, a lot of possible new friends and a better understanding on what we need to do to be more effective next time. T-shirts first!!! Lesson learned we bring to few … we could have cover all the bay area with LOVE YOUR DATA, we miss the target this year, we will not do the same the next one.

I think that this year we had a well-balanced expo, with less show, but more focus, I must also mention the presence outside the expo area of Julian Cash (http://jceventphoto.com/index.html), which takes a lot of cool shot of most of us.

I know Julian was there also during other conferences but I never met him before. I did this year and was a great experience, I hate to takes photos but with Julian I was having so much fun that at the end I love it.

What about the Lightening Talks?

Another well establish event at MySQL conference, and every year we have fun. This year I have enjoy the Shlomi one, and absolutely AWSOME was the performance from the Tokutek team.

About the Lightening talks and just to confirm what Dave Apgar was saying in his really good presentation, shit happens and you never know when. I had place my video camera and register the WHOLE event, and guess what… my new SD card just failed, and nothing I mean NOTHING was there after. Next time I will come with TWO video cameras and will setup redundancy!

Announcements

Finally during the conference we had two very significant announcements.

The first one was the expected merge of MariaDB and SkySQL, nothing new, but it is good to see that SkySQL is defining his identity with more determination, but not only this merge is very important because all MariaDB users now have a clear referring point, that will hep them and the community to better adopt and improve MariaDB. Way to go guys well done!

The second one is about Tokutek (http://www.tokutek.com/), finally open source. I have tested it the first time 3 years ago, and was a very interesting technology, but hard to have implemented because customers where reluctant to go for non-open source code.

Just a note, I wrote open source, not free. Open source doesn’t mean free, and here the concept was very clear, customers were willing to “eventually” pay, but not for close code.

Tokutek move is not only smart because will allow the company to have substantial help from the community in identify issues, improve utilization and identify new trends, but it is smart also because remove the last philosophical barrier in the software adoption.

From the technical point of view, the presentations have shown a significant improvement in respect to the previous years, and I was very impress from the presentation done by Gerry Narvaja during the SkySQL/MariaDB event.

One thing is sure, I have customers that could take huge benefit from Tokutek and I will give a try right away starting next week.

Winner and looser

No doubt from my side, I was not even mentioning before because for me is a given.

On the 12 of March 2011 I have written this article http://www.tusacentral.net/joomla/index.php/mysql-blogs/96-a-dream-on-mysql-parallel-replication, which was my dream about replication. At the time of writing I was not aware/testing ANY software able to do what I was asking.

On the 23 November 2011 I wrote another article http://www.tusacentral.net/joomla/index.php/mysql-blogs/119-galera-on-red-hat-is-123-part-1, and that was my first approach with Galera.

On the 29 September 2012 I have presented the first results of a POC done on customer environment http://www.slideshare.net/marcotusa/scaling-with-sync-replication-2012.

Next week, I must implement another MySQL Cluster, base on Galera replication.

The winner for me is the Galera solution (http://www.codership.com/), whatever version you may like; from my side I have found that the Percona version is the more stable, and using the Severalnines tool to manage the cluster (http://www.severalnines.com/clustercontrol) is also helpful.

Who is the looser then?

All the ones that have believe MySQL was over on the 2010 (Oracle take over).

We have MySQL from Oracle, we have MariaDB from Monty, we have companies developing their storage engines and tools, we have a more complex ecosystem that is growing day by day.

No MySQL is not over at all.

One note only, whoever leads the development from any side, reminds that you MUST allow the community to use and develop code on top of yours, modify interfaces without documenting, or not be fully explicit on what to do, how to do, and which direction, well it is not fair.

Percona Live, MySQL Conference in Santa Clara was a great conference, done by great people. We can do better all of us, always, but what makes me feel good is that I know we will do better next year.

Last note…

Did not you miss some one? Did you as I did, feel as that something was not right? Was not a face missed?

I mean … yes! He! Baron Schwartz!! Hey man we miss you!!! Or at list I miss you and your block notes during the presentations, come back ASAP.

Happy MySQL to everyone.

References

http://www.codership.com/

http://www.severalnines.com/

http://jceventphoto.com/index.html

https://www.percona.com/live/mysql-conference-2013/

http://www.tokutek.com/

http://www.skysql.com/

http://www.pythian.com/

https://vividcortex.com

And more ...

{joscommentenable}

Amazon EC2 - RDS quick comparison

What this is about

The following is a review of the real status about Amazon RDS in comparison with EC2.

The purpose is to have a better understanding of possible limitation in the platform usage, and what is a possible fit and what not.

{autotoc enabled=yes}

Why

I did a first review an year ago for internal purpose, but now we are receiving the same questions over and over from different customers.

Given that and given a lot of things could happen in one year, I have decide to repeat the review and perform the tests once again.

What needs to be underline is that, I am doing this in consideration of a usage in PRODUCION, not as QA or development.

So my considerations are obviously focus on more demanding scenarios.

About EC2 and RDS.

Machine configuration

There are different ways that we can choose to start our EC2 or RDS, both have different cost and “virtual” physical characteristics, the list for both is below:

EC2

T1 Micro (t1.micro)      Free tier eligibleUp to 2 ECUs1 Core613 MiB

M1 Small (m1.small)1 ECU1 Core1.7 GiB

M1 Medium (m1.medium)2 ECUs1 Core3.7 GiB

M1 Large (m1.large)4 ECUs2 Cores7.5 GiB

M1 Extra Large (m1.xlarge)8 ECUs4 Cores15 GiB

M3 Extra Large (m3.xlarge)13 ECUs4 Cores15 GiB

M3 Double Extra Large (m3.2xlarge)26 ECUs8 Cores30 GiB

M2 High-Memory Extra Large (m2.xlarge)6.5 ECUs2 Cores17.1 GiB

M2 High-Memory Double Extra Large (m2.2xlarge)13 ECUs4 Cores34.2 GiB

M2 High-Memory Quadruple Extra Large (m2.4xlarge)26 ECUs8 Cores68.4 GiB

C1 High-CPU Medium (c1.medium)5 ECUs2 Cores1.7 GiB

C1 High-CPU Extra Large (c1.xlarge)20 ECUs8 Cores7 GiB

High Storage Eight Extra Large (hs1.8xlarge)35 ECUs16 Cores117 GiB

RDS

Micro DB Instance: 630 MB memory, Up to 2 ECU (for short periodic bursts), 64-bit platform, Low I/O Capacity

Small DB Instance: 1.7 GB memory, 1 ECU (1 virtual core with 1 ECU), 64-bit platform, Moderate I/O Capacity

Medium DB Instance: 3.75 GB memory, 2 ECU (1 virtual core with 2 ECU), 64-bit platform, Moderate I/O Capacity

Large DB Instance: 7.5 GB memory, 4 ECUs (2 virtual cores with 2 ECUs each), 64-bit platform, High I/O Capacity

Extra Large DB Instance: 15 GB of memory, 8 ECUs (4 virtual cores with 2 ECUs each), 64-bit platform, High I/O Capacity

High-Memory Extra Large DB Instance 17.1 GB memory, 6.5 ECU (2 virtual cores with 3.25 ECUs each), 64-bit platform, High I/O Capacity

High-Memory Double Extra Large DB Instance: 34 GB of memory, 13 ECUs (4 virtual cores with 3,25 ECUs each), 64-bit platform,

High I/O CapacityHigh-Memory Quadruple Extra Large DB Instance: 68 GB of memory, 26 ECUs (8 virtual cores with 3.25 ECUs each), 64-bit platform, High I/O Capacity

Embedded features

EC2 imply that we do by ourselves the installation, setting up and maintenance of our system and Database software, but for RDS, Amazon provide few “features” that is important to keep in mind and have in mind for later discussions.

The most relevant are:

Pre-configured Parameters – Amazon RDS DB Instances are pre-configured with a sensible set of parameters and settings appropriate for the DB Instance class we select.

Monitoring and Metrics – Amazon RDS provides Amazon CloudWatch metrics for your DB Instance deployments at no additional charge. You can use the AWS Management Console to view key operational metrics for your DB Instance deployments, including compute/memory/storage capacity utilization, I/O activity, and DB Instance connections.

Automated Backups – Turned on by default, the automated backup feature of Amazon RDS enables point-in-time recovery for your DB Instance.

DB Snapshots – DB Snapshots are user-initiated backups of your DB Instance.

Multi-Availability Zone (Multi-AZ) Deployments – Amazon RDS Multi-AZ deployments provide enhanced availability and durability for Database (DB) Instances, making them a natural fit for production database workloads.

When we provision a Multi-AZ DB Instance, Amazon RDS automatically creates a primary DB instance and synchronously replicates the data to a standby instance in a different Availability Zone (AZ).

Each AZ runs on its own physically distinct, independent infrastructure, in case of an infrastructure failure (for example, instance crash, storage failure, or network disruption), Amazon RDS performs an automatic failover to the standby so that you can resume database operations as soon as the failover is complete.

Read Replicas – This replication feature makes it easy to elastically scale out beyond the capacity constraints of a single DB Instance for read-heavy database workloads. Amazon RDS uses MySQL’s native replication to propagate changes made to a source DB Instance to any associated Read Replicas.

Storage

As shown before in RDS we cannot do too much regarding the storage, we can just choose between different instances, and if we want to have provisioned IOPS.

On EC2 we obviously have both, but we also can choose how to define and use our storage solution.

MySQL configuration

Amazon presents to us the Pre-configured parameters as a cool “feature” but this is just one side of the coin. The other side is that we cannot really adjust some of the critical parameters for MySQL, or that their values are not as define in standard MySQL.

The parameters in discussion are:

binlog_format | STATEMENT expire_logs_days | 0

(calc)| innodb_buffer_pool_size | 3921674240

innodb_doublewrite | ON

innodb_file_format_max | Antelope

innodb_locks_unsafe_for_binlog | OFF

innodb_log_file_size | 134217728

innodb_log_files_in_group | 2

innodb_log_group_home_dir | /rdsdbdata/log/innodb (Max 300)

innodb_open_files | 300

max_binlog_size | 134217728 (max 4294967295)

max_join_size | 4294967295

open_files_limit | 65535

Most concerning are, the binlog format, InnoDB Log related ones.

Multi-AZ implementation

From the architectural point of view, I do not have a clear way of HOW the Multi-AZ is implemented, and I am really interested in discovering how in Amazon they have achieved the declared Synchronous replication.

I am just guessing here but some base replication using DRBD Primary/Secondary seems the most probable. What could be concerning here is the protocol level use for such replication, and level of block transmission acknowledge, given a full protocol C will be probably to expensive, also if the ONLY really safe in the case of DRBD usage. But given I don’t have clear if the solution is really using it, let me just say it will be good to have better insight.

Standard replication

We cannot use standard replication in RDS, we need to rely on Read-Replicas, or not use replication at all. The only solution is to use external solution like Continuent Tungsten (http://scale-out-blog.blogspot.ca/2013/01/replicating-from-mysql-to-amazon-rds.html).

It is important to note that RDS replication between master and Read-replica is using the STATEMENT binlog format, and it cannot be change, as direct consequence we do have inefficient replication between master and replicas for all non-deterministic statements, and in case of mixed transactions between storage engines.

Tests done

The test performed where not too intensive, and I was mainly focus on identify what will be the safe limit of usage/load for RDS in comparison to an EC2 instance properly set.

As such I have choose to use the Large Instance set for both EC2 and RDS, with 2 virtual CPU 7.5GB virtual RAM, High I/O capacity for RDS.

For EC2 the only difference will reside in the fact I perform the tests using 1 EBS for the data directory in one case, and a raid5 of 4 EBS in the other.

Also in regards of the MySQL configuration I have “standardize” the configuration of the different instance using the same parameters.

Only differences was that I was not using SSL in the MySQL EC2 instance, while it cannot be turn off in RDS because Amazon security is relying on it.

The test was using a variable number of concurrent threads:

Writing on 5 main tables and on 5 child table.

Read on main table joining 4 tables to main, filtering the results by IN clause in on test, and by RANGE in another.

The structures of the tables are the following:

mysql> DESCRIBE tbtest1; 
 
 FIELD         | Type         | NULL | KEY | DEFAULT           | Extra 
 autoInc       | bigint(11)   | NO   | PRI | NULL              | AUTO_INCREMENT 
 a             | int(11)      | NO   | MUL | NULL              | 
 uuid          | char(36)     | NO   | MUL | NULL              |  
 b             | varchar(100) | NO   |     | NULL              | 
 c             | char(200)    | NO   |     | NULL              | 
 counter       | bigint(20)   | YES  |     | NULL              | 
 time          | timestamp    | NO   |     | CURRENT_TIMESTAMP | ON UPDATE 
 partitionid   | int(11)      | NO   |     | 0                 | 
 strrecordtype | char(3)      | YES  |     | NULL              |

 
 mysql> DESCRIBE tbtest_child1; 
 
 FIELD        | Type         | NULL | KEY | DEFAULT           | Extra | 
 a            | int(11)      | NO   | PRI | NULL              | 
 bb           | int(11)      | NO   | PRI | NULL              | AUTO_INCREMENT 
 partitionid  | int(11)      | NO   |     | 0                 | 
 stroperation | varchar(254) | YES  |     | NULL              | 
 time         | timestamp    | NO   |     | CURRENT_TIMESTAMP | ON UPDATE

The filling factor for each table after the initial write was:

Table    tbtest1    total    1046    least    745    bytes per char: 3

Table    tbtest2    total    1046    least    745    bytes per char: 3

Table    tbtest3    total    1046    least    745    bytes per char: 3

Table    tbtest4    total    1046    least    745    bytes per char: 3

Table    tbtest5    total    1046    least    745    bytes per char: 3

Table    tbtest_child1    total    779    least    648    bytes per char: 3

Table    tbtest_child2    total    779    least    648    bytes per char: 3

Table    tbtest_child3    total    779    least    648    bytes per char: 3

Table    tbtest_child4    total    779    least    648    bytes per char: 3

Finally the total size of the data set was of 20Gb.

The inserts were using batch approach of 50 inserts per Insert command for all the platforms.

Below the summary of the tests to run

oltp 5 + 4 table write 4 -> 32

oltp 5 + 4 table read (IN) 4 -> 32

oltp 5 + 4 table read (RANGE) 4 -> 32

oltp 5 + 4 table write/read(IN) 4 -> 32

Results

Results for write using 4 to 32 concurrent threads

Write Execution time

(High value is bad)

executiontime_write

As the graph clearly shows, the behavior of the RDS and EC2 with one EBS is quite similar, while the EC2 running a RAID of EBS is maintaining good response time and scales in writes, the other two have a collapse point at 16 Threads, after which performance are becoming seriously affected.

Rows inserted

(High values is good)

rows_write

Consistently the number of Rows inserted in a defined period of time, see again the EC2 with RAID5 performing in the optimal way in relation to the other two.

During this test the performance loss starts at 8 threads, for EC2 solutions, while for the RDS solution it is with the increase of concurrency that we immediately see the performance degradation.

Select Execution time with IN

(High value is bad)

executiontime_read_in

Using the select with IN given the high efficiency of the IN approach, and the reduce number of reads that require to be executed on disk, all the instance maintain a good level of performance.

Rows reads with IN

(High value is good)

In this case all the instances are consistently performing, but the EC2 with RAID solution can serve a larger amount of requests, almost 1/3 larger then of the RDS.

Select Execution time with RANGE

(High value is bad)

In the case of range selects and heavy access on disks, the RDS and EC2 with 1 EBS, are absolutely not able to perform at the same level of the RAID solution. This quite obviously related to the amount of data needs to be read from disks, and the limitation existing in RDS and 1 EBS solutions.

Rows reads with RANGE

(High value is good)

rows_read_range

The volume test confirms and highlights the different behavior between EC2 RAID and the others, at 32 concurrent threads the RDS solution tends to collapse, while the EC2 RAID is serving successfully the traffic also if with less efficiency.

Select Execution time with mix of SELECT and INSERT

(High value is bad)

Rows reads with mix of SELECT and INSERT

(High value is good)

rows_read_write

In a mix workload, I had unexpected results, with EC2 1 EBS behaving very badly when working with more then 16 threads, this given the I/O contention and possible RDS optimizations, implemented by Amazon to prevent single EBS problems.

Except that the RDS and the EC2 with RAID behave as I was expecting, with EC2 able to manage a larger volume of traffic, and the Inserts limiting the number of reads, as expected.

Conclusions

The comparison between RDS and EC2, cover several areas, from performance to High Availability.

My conviction is that RDS is not implementing a solid and trustable HA solution given the not clear way synchronous replication is implemented.

RDS is not applying correct best practices for replication given the use of STATEMENT format and the limitation existing in the replication management.

Finally RDS is not really efficient in managing large volume of traffic, or applications with a large number of highly concurrent threads.

Never the less it could be a temporary solution for very basic utilization in application that do not have demanding requirements.

RDS can probably further optimize, but I am sure it will never be enough to consider RDS production ready.

EC2 is more flexible and allow better tuning and control of the platform, multiple HA solutions and full control of replication and MySQL in general. All these define the significant difference with RDS, and draw the line for the right use of the tool.

Also there is not a difference in what kind of MySQL distribution we will implement, given that the source of the issue is on the platform.

My final advice is to use RDS for development or as temporary solution in a start-up, but it should not be use in the case of critical system or consolidated mature application, which require high available, scalable database support.

Reference

http://aws.amazon.com/rds/

http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/RDSFAQ.MultiAZ.html

http://scale-out-blog.blogspot.ca/2013/01/replicating-from-mysql-to-amazon-rds.html

http://www.mysqlperformanceblog.com/2011/08/04/mysql-performance-on-ec2ebs-versus-rds/

{joscommentenable}

MySQL and drop packet

Overview

Last night a customer call us because was having issue on the server and loss of performance on the MySQL server.

When I join the bridge I ask to the customer a quick report of what was his experience and concern so far.

Luckily all the participants were technically skilled and team was compose by, SAs, DBAs, Team leader, so I was able to have a good overview in short time.

Mainly there were two fronts, one was the problem on the server in the network layer, the other in MySQL that was not able to manage efficiently the number of thread opening requests.

The machine has a single NIC, Storage attach by fibre channel, 8CPU hyper threading, 64GB RAM and finally heavy usage of NFS.

The same server was in the past using the volumes on the NFS also for MySQL, but now everything was moved to the attached storage.

As said the issue was that NIC was reporting drop packet wand MySQL was having issue to manage the number of threads, the last ones were between 200 - to 1000 connection requests.

As active threads the server was managing 200-300 threads, which was not enough.

I start reviewing the server and NIC issue, talking with the SAs they report that the NIC Receive buffer, was already set to maximum of 4096k.

So starting the investigation from there I review the back_log net.ipv4.tcp_max_syn_backlog, and all the other parameters related to TCP buffer:

CURRENT TCP buffer setting 
------------------------------
net.ipv4.tcp_mtu_probing = 0
net.core.rmem_max = 131071
net.core.wmem_max = 131071
net.ipv4.tcp_rmem = 4096    87380    4194304
net.ipv4.tcp_wmem = 4096    16384    4194304
------------------------------

The settings were misconfigured given that the tcp value cannot override the core values.

As such the settings for the tcp auto tuning were invalid for the max limit.

Given those values were not correct for a machine supporting high traffic I suggest:

Suggested TCP buffer settings
------------------------------
#TCP max buffer size
net.core.rmem_max = 16777216 
net.core.wmem_max = 16777216 
 
#Linux auto-tuning TCP buffer 
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
 
#length of the processor input queue
net.core.netdev_max_backlog = 30000
 
#default congestion control is htcp 
net.ipv4.tcp_congestion_control=htcp

About htcp see the reference to the document explaining in details the algorithm.

From the mysql point, I review few parameters that would have a direct relation with the Threads.

MySQL change
-----------------------------------------------
back_log                        = 1024
thread_cache_size               = 512
thread_stack                    = 256K
wait_timeout                    = 600

I decide to set value of backlog as the maximum queue we have seen, move the value of the thread_cache_size from 8 to 1/4 of the max number of connection,

then few correction given wait_timout was set as default and thread_stack was set as for 32bit machines.

When we apply the values, I was expecting to see the drop packet issue solve, instead MySQL was managing much better the incoming connection, but the drop packet were still there.

Also the bad network behaviour, was preventing to the data to flow as it was suppose to be.

We then focus on why this was happening reviewing all the changes applied.

After few investigations and researches the customer, realize that the value for the receive window on the NIC, was not really applied,

because also if it is declared as dynamic value by Cisco, the facto it require a reboot of the machines.

We reboot it, and the NIC was now working properly. Data was floating fine with no drop packet.

MySQL was managing the incoming thread efficiently, but I notice, after the server warm-up, that the performance were still not optimal.

So doing some other tuning, I set thread_cache_size to 1024 paring the back_log number,

at this point MySQL was managing the incoming new request very efficiently, we had a jump to of Threads_cached to 850 with obvious flotation between 8 threads up to the maximum,

Threads_created the same and then just a bit larger then the maximum number of created connections, and finally Threads_running jump from 200 - 300 to 600 - 700.

Conclusion

The drop packet is a sign of insufficient buffer, either from the NIC or the TCP, remember to review the parameters related.

"Surprising" also if the thread creation process was alway declare as "easy" and "light" a more aggressive thread settings makes MySQL act more properly.

Reference

http://www.kernel.org/doc/man-pages/online/pages/man7/tcp.7.html

http://www.hamilton.ie/net/htcp3.pdf

{joscommentenable}

Some thought on our work, from a different perspective

{autotoc enabled=yes}

Introduction

Recently I had some free spare time that I used to read, think, process and analyze few ideas and work on my own projects.

That was great for me because I had the chance to develop new tools and to review few concept related to work. So I had the chance to focus on the ideas behind the procedures or "how-to", including reviewing what I am doing at work, from different angles and prospective.

One of the different or better to say modified prospective, was the outcome of a mental process started with a reading.

Reading that I initially considered a waste of money, time and mind effort.

This because the topic discussed, and the way the topic was presented, is something that I had the chance to study when I was in school starting from secondary. In fact the topic of "Critical thinking" is, or I should say was, included in the school programs in our learning path associate to "Logic", "Grammar" and "Philosophy".

So when I read the books I commit the crime of assumptiveness, feeling also bore while reading, until the moment the book was covering the topic related to "Decision Making". In that chapter the writer was underlining how easy is for us to be caught in trap by our own knowledge and ideas.

I did stop to read, close the book went doing something else, and then I try to empty my mind. Only at that point I realize I was not reading at all, better my eyes were, some part of my brain was, but my mind and my attention were not, this because I had categorize the book from the initial chapters in the erroneous way.

So I take a glass of wine, had some time, good music and open the book again from the start. This time the book was presenting me a different scenario and prospective, it drives my thought through several mind paths and brings me to review some assumptions. At this point I was able to make some parallels with what is our/mine day-to-day activity in life and work. It was funny for me to discover how some personal best practices, fit perfectly in a well categorize universal model.

There was no magic, that is true, but what was good and interesting, was how "School Training" can forge your way of doing in an instinctive way, but also how the instinctive action path, can be transform and express in few, clear, universal and simple to read steps.

In particular I saw a good parallel with two critical areas of our work; the credibility of the source and the process of the decision-making.

The rest of the writing is a summary, a go through few points I have identify as relevant, and that I have see covering some critical grey areas.

I am aware this is just a small part of the picture, and as usual I am open to discussion and comments, and I will be more then happy if that will happen, actually this will mean that I have reach my target.

Credibility of the Source

In our work, as well as in many others, having a good credibility is not a plus but a must. Being credible as a company or as a single person, is not coming from free, and is a process that could takes years to build and days to get destroy.

The credibility is not only the result of "best practices" or "how to", but is also the result of a correct approach and process in what we do, how we do it and how we decide to do things (see after the decision-making section).

Whenever a customer will come to us, for an advice or help, he will ask to himself some questions, questions that we should answer or the customer will redirect his attention to someone else.

Those questions are:

Do they (us) have the relevant expertise (experience, knowledge and if needed formal qualification)?
Do they have the ability to observe accurately (eyesight, hearing, proximity to event, absence of distraction, appropriate instruments, skills in using instrument)?
Does their reputation suggest that they are reliable?
Do they have any interest or possible bias?
Are them claiming and providing evidences of knowledge about MY context?
Are they providing direct expertise?
Is they level of expertise base on direct experience?
Is what they say support by evidence and logic pattern?
Are other sources consistent?

Answer to all the above, as said, is not something you can achieve with limited or superficial effort, it instead require an extensive and constant shift in mentality, and require some well define ideas and behaviour. My interpretation is the following:

Always be "super partes", also avoid as much as possible to follow ephemeral trends, like the use and abuse of the "magic" term of the year, often used by others to show they capacity to be on the "trend". Unfortunately be there very often means doing without knowing. Be more conservative and analytic is the right things to do when responsible of other people.
Be under constant training and education, perform extensive tests, and provide public evidence of our conclusion and analysis. Publish few but focus blogs.
Avoid blog about everything, and avoid generalization, that will create more noise and confusion, yes you will be there, but as chatter not as an expert.
When claiming about something, provide evidence and a well-documented reasoning path to support your claim.
Always put the claim in a clear defined context, and if possible and available include the references to others reasoning and/or similar evidence and sources.
Whenever possible try to be or use a direct source, like provide the test you have done yourself, or review and repeat the tests done by others to validate them.
Never use other source material as yours, instead document them and contextualize them providing credit to the source. Again double check other source conclusion and provide evidence of your process.
Whatever evidence or conclusion you will provide, it needs to have an exact match with the discussed topic, avoid generalization. Assumption can be good only if supported by good and documented reasoning.
Do not rush, this is not a race, do not send out an answer or a comment without having the time to think on it. If possible, review it several times, and cover your reasoning also with others, this to be sure you have cover all the possible areas of uncertainly, and if you still see them, declare them.

I will be more then happy to have discussion on the above points, and if possible to extend them including more helpful suggestions.

Decision-making process

As mention previously the other point is related on how we take our own decisions, and how we evaluate other people conclusion/reasoning/motivation.

In our work we are constantly call to take decisions, some of them are very simple ones, and we can take action with very limited thinking, but others could be much more complex and could require significant effort from our side, more time and processing to efficiently evaluate what will be the right decision.

Unfortunately very often we are affected by at least one of the following bad behaviour:

We do not give us enough time to think.
We see a possible fit on a though and we remain there not giving us the space to evolve.
We do not process all the possible alternatives to/of the problem, and we do not develop more then one solution.
We do not evolve our solution/action into a clear path of possible consequences.
We often skip to take more information because the "time" issue (or others) and miss relevant part of the puzzle.
We sometime forget what is relevant for us and how much this can impact on our judgment.
We are emotionally involved and it affects the process and the decisions.
We just do what our Boss is saying to do.
Other recommendations influence us without applying analytical thinking.

Going through the above points, trying to clarify and to see what we can do to prevent them.

Time, time is relevant and often we have to take some decision fast, but thinking require time, time to take information, time to analyze them, time for the reasoning. The process should not be compromise by our rush, because results will be affected and our decisions can be imprecise (if lucky) or completely wrong, not only it could happen to take a wrong decision, but when this happen because rush, we do not have a good reasoning to support and justify our mistake, in short there will not be a learning lesson, only the mistake.
How often have we feel in love with our ideas, and not ready to divorce from them? Too often we must admit it, instead we should be able to go beyond and process all the possible options. We should keep our mind open and listen to other external suggestion but always applying analytical process.
When I was a kid I learn that "each action include/imply a reaction". Before performing any action, before apply what we think is correct in our decision, we should carefully think "What will happen next?". We should analyze the actions, and have a good level of understanding of what will be the path of events that our actions will generate, and be ready for possible unexpected bad behaviour.
In our job information about what is going on is everything. We should never stop to dig more, and get better understanding. Never consider the outcome of some tools/script enough for our analysis, taking their results as given without applying an analytical review. We should stop only when we are really confident that we cannot get more relevant information, and if possible we should ask to a trusted source to compare what we got, to see if we have miss anything.
Sometime we forget that we have personal commitments, those could affect our judgment. For example, if we are fully focus on open source, it could become almost automatic for us to skip the evaluation of a non-open source solution. Or if we are Linux fundamentalist just to have to approach windows server, could drive us to have a not objective approach to the problem. Again we must keep our mind open and process the problem by analytical steps, not considering the preconceptions in our thinking, but be able to filter them out and have an objective mind process.
How many times we have found that customer so annoying? His reiterate questions where lacking of any sense, and is not some time his behaviour to be so close to be offensive? On the other hand this other customer is really nice, he gives you a lot of credit, he has a good understanding of the effort you are doing to keep his environment in good shape. Can you honestly say that you have always gives to the two the same "time and attention"? This is a fact, it is in the human nature, to be more careful and nice comes easier with the ones that are nice with us. But this is not correct, we should always apply the same time/effort/reasoning independently to the customer behaviour. The reasoning is the point not our feeling. Understanding it and be able to mange it is a matter of be more or less professional.
Do not follow the boss or others advice, direction blindly. We must listen carefully to anyone, we should evaluate what they have to say and objectively extract whatever is good from their suggestion or recommendation. But never accept it without our own thinking/reasoning; also it will be appropriate to share with them our process step by step, before getting to the conclusion. This will help us in learning from each other work, and will provide advantage to everyone also reducing the chance of mistakes.

Summarizing we should ask ourselves the following before, during and after having done our reasoning for a decision:

What make this decision necessary? What is the objective?
What I am going to recommend, and on what basis?
What other possible alternatives exists, which one is the more realistic feasible, which one the more innovative?
What are the possible consequence of my decision, and how likely they are going to happen?
If this consequence will happen what will be the relevance and how we can manage them?
Comparing different solution, which one will be the best to mitigate negative effect?
How I can transform my decision into an action reducing to minimum the risk of bad behaviour or mistakes?

Conclusion

In the above sections, I was just trying to report in a concise and easy way, what is part of a more complex topic. I am aware that most of us do the right thing, just doing it right, but I am also confident that reporting black on white those simple points could help us to avoid mistakes, and if possible to define process and checklists that other people less conscientious then us, can follow to make their work behaviour more trustable.

Reference

Ennis , R.H Critical thinking Prentice Hall 1996

Fisher A The logic of real arguments Cambridge University press 1988

Fisher A Critical Thinking – an Introduction Cambridge University press 2001

{joscommentenable}

MySQL Blogs

The Galera series

Overview and some history

Architecture

Implementation

Basic requirements

Packages:

Configuration

PORTS:

MySQL:

Galera:

IST

SST

Main steps

Step by step

configure the environment

IN case of binary use

Configure HAProxy check

Perl setup

Deploy the first MySQL Galera node

Deploy all other nodes

Deploy HAProxy

Quick check for the connection

POC steps

POC Tests

Functional tests:

Performance/capacity tests (including difference in using ONE single node, Three to seven nodes, full capacity):

Additional article about galera

Reference

What about the conference?

MySQL where are you going?

What about the Speeches

What about the Expo

What about the Lightening Talks?

Announcements

Winner and looser

Who is the looser then?

Last note…

What this is about

Why

About EC2 and RDS.

Machine configuration

Embedded features

Storage

MySQL configuration

Multi-AZ implementation

Standard replication

Tests done

Results

Write Execution time

Rows inserted

Select Execution time with IN

Rows reads with IN

Select Execution time with RANGE

Rows reads with RANGE

Select Execution time with mix of SELECT and INSERT

Rows reads with mix of SELECT and INSERT

Conclusions

Overview

Conclusion

Reference

Introduction

Credibility of the Source

Decision-making process

Conclusion

Reference

More Articles ...

Related Articles

login