My MySQL tips

Status update on how MySQL handle the partition(s) for maintenance

Overview

We know that having the chance to split a table in different and smaller chunk helps.
It helps because performance searches, insert, index handling and data management as well.
All fine we are all happy, and very often we push on our customer to go for partition in order to solve part of their problems.
What happens quite often is that customer (and sometime MySQL dba as well) see partitions as separate tables.
Unfortunately is not like that, there are some operations that are (still?) creating unexpected results.
One of this is the OPTIMIZATION of a partition.

The case

Some time ago, almost 16 months I was at customer site and we were working with MySQL 5.5 rc.
We were really happy with the new version of MySQL, and being working for MySQL/SUN I was quite proud of it as well.
Then we run the ALTER TABLE X OPTIMIZE PARTITION Y;

Suddenly we realize something was not working as expected because all partitions had temporary tables.
We know and were expecting the LOCK while doing the exercise, but this was not expected at all.

I was attending several internal presentations, and I was in many internal calls,
all stating that the operation should not affect all partitions but only one.
It was not nice, and it was a bug (42822) marked non critical, referring to another one 46158,
which was close because referring to the previous as a duplicate.

Funny thing at the end is ... we still have it in 5.5.12.

There is a work around as usual, but CUSTOMERS need to be aware and manual modified according,
unfortunately this is not the case.

So after all this time, it could sounds silly but let us talk how OPTIMIZE partitions doesn't work and what to do.

What I have done to replicate the error

I have set up a simple test instance with two tables, and use my stresstool (java),

to fill the tables and play with the inserting threads.

Here the tables:

DATABASE changed
root@localhost [test]> SHOW TABLES;
+----------------+
| Tables_in_test |
+----------------+
| tbtest1        |
| tbtest2        |
+----------------+
4 rows IN SET (0.00 sec)
 
root@localhost [test]> DESCRIBE tbtest1;
+---------------+--------------+------+-----+-------------------+-----------------------------+
| FIELD         | Type         | NULL | KEY | DEFAULT           | Extra                       |
+---------------+--------------+------+-----+-------------------+-----------------------------+
| a             | int(11)      | NO   | MUL | NULL              |                             |
| uuid          | char(36)     | NO   | PRI | NULL              |                             |
| b             | varchar(100) | NO   |     | NULL              |                             |
| c             | char(200)    | NO   |     | NULL              |                             |
| counter       | bigint(20)   | YES  |     | NULL              |                             |
| time          | timestamp    | NO   |     | CURRENT_TIMESTAMP | ON UPDATE CURRENT_TIMESTAMP |
| partitonid    | int(11)      | NO   |     | 0                 |                             |
| strrecordtype | char(3)      | YES  |     | NULL              |                             |
+---------------+--------------+------+-----+-------------------+-----------------------------+

How the data looks like:

root@localhost [test]> SELECT * FROM tbtest1 LIMIT 10;
+------------+---------------+----------------+---------------+-------------+---------------------+------------+---------------+
| a          | uuid          | b              | c             | counter     | time                | partitonid | strrecordtype |
+------------+---------------+----------------+---------------+-------------+---------------------+------------+---------------+
|  793902056 | 00ea7db2babeb | ac hpzenqafpml | ac hpzr
|  366195670 | 07c620fdfab8d | umlfzae zdzbes | umlfzapgnhlvb |  3359819254 | 2011-06-03 12:27:26 |         18 | fai           |
|   19501151 | 08271816a1256 | odpp gporogqmg | odpp glpgfgnf |  3010735465 | 2011-06-03 12:27:26 |          5 | odo           |
| 1925082207 | 0bf7d042da285 | toammdmqgttlto | toammdeioidzs | 18048741375 | 2011-06-03 12:27:26 |         17 | gvm           |
|  631797159 | 0c9d4857f2f27 | lisn vqclglnmm | lisn vggpirza | 40556949738 | 2011-06-03 12:27:26 |         18 | ldc           |
+------------+---------------+----------------+---------------+-------------+---------------------+------------+---------------+
10 rows IN SET (0.01 sec

How the Table looks like before partition:

 
root@localhost [test]> 
CREATE TABLE `tbtest1` (
  `a` int(11) NOT NULL,
  `uuid` char(36) COLLATE utf8_unicode_ci NOT NULL,
  `b` varchar(100) COLLATE utf8_unicode_ci NOT NULL,
  `c` char(200) COLLATE utf8_unicode_ci NOT NULL,
  `counter` bigint(20) DEFAULT NULL,
  `time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `partitonid` int(11) NOT NULL DEFAULT '0',
  `strrecordtype` char(3) COLLATE utf8_unicode_ci DEFAULT NULL,
  PRIMARY KEY (`uuid`),
  KEY `IDX_a` (`a`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
1 row IN SET (0.02 sec)

Then Insert the partition definition:

ALTER TABLE tbtest1 DROP PRIMARY KEY, ADD PRIMARY KEY (`uuid`,`partitonid`)
PARTITION BY RANGE (partitonid) (
    PARTITION p0 VALUES LESS THAN (2),
    PARTITION p1 VALUES LESS THAN (4),
    PARTITION p2 VALUES LESS THAN (6),
    PARTITION p3 VALUES LESS THAN (8),
    PARTITION p4 VALUES LESS THAN (10),
    PARTITION p5 VALUES LESS THAN (12),
    PARTITION p6 VALUES LESS THAN (14),
    PARTITION p7 VALUES LESS THAN (16),
    PARTITION p8 VALUES LESS THAN (18),
    PARTITION p9 VALUES LESS THAN (20),
    PARTITION p10 VALUES LESS THAN MAXVALUE
);
 
SELECT TABLE_SCHEMA,TABLE_NAME,partition_name,TABLE_ROWS AS 'ROWS', 
TRUNCATE(DATA_LENGTH/pow(1024,2),2) AS 'DATA (M)', 
TRUNCATE(INDEX_LENGTH/pow(1024,2),2) AS 'INDEX (M)',
TRUNCATE((DATA_LENGTH+INDEX_LENGTH)/pow(1024,2),2) AS 'TOTAL(M)' 
     FROM information_schema.partitions 
     WHERE TABLE_SCHEMA <> 'information_schema' 
       AND TABLE_SCHEMA <> 'mysql' 
       AND TABLE_NAME = 'tbtest1'
     ORDER BY `TOTAL(M)` DESC,`DATA (M)` DESC;
+--------------+------------+----------------+------+----------+-----------+----------+
| TABLE_SCHEMA | TABLE_NAME | partition_name | ROWS | DATA (M) | INDEX (M) | TOTAL(M) |
+--------------+------------+----------------+------+----------+-----------+----------+
| test         | tbtest1    | p0             |  244 |     0.12 |      0.01 |     0.14 |
| test         | tbtest1    | p8             |  227 |     0.12 |      0.01 |     0.14 |
| test         | tbtest1    | p7             |  211 |     0.12 |      0.01 |     0.14 |
| test         | tbtest1    | p5             |  187 |     0.12 |      0.01 |     0.14 |
| test         | tbtest1    | p4             |  227 |     0.12 |      0.01 |     0.14 |
| test         | tbtest1    | p3             |  212 |     0.12 |      0.01 |     0.14 |
| test         | tbtest1    | p1             |  204 |     0.12 |      0.01 |     0.14 |
| test         | tbtest1    | p6             |  175 |     0.10 |      0.01 |     0.12 |
| test         | tbtest1    | p2             |  235 |     0.10 |      0.01 |     0.12 |
| test         | tbtest1    | p9             |  229 |     0.10 |      0.01 |     0.12 |
| test         | tbtest1    | p10            |    0 |     0.01 |      0.01 |     0.03 |
+--------------+------------+----------------+------+----------+-----------+----------+

And fill it with some data with my stresstool, but keeping the partition P10 empty, loading values up to PARTITIONID <=20

then filling P10 with an insert:

INSERT INTO tbtest1 (a, uuid,b,c,counter,time,partitonid,strrecordtype) 
SELECT a, uuid,b,c,counter,time,30,strrecordtype FROM tbtest1 WHERE partitonid <=20;
 
+--------------+------------+----------------+--------+----------+-----------+----------+
| TABLE_SCHEMA | TABLE_NAME | partition_name | ROWS   | DATA (M) | INDEX (M) | TOTAL(M) |
+--------------+------------+----------------+--------+----------+-----------+----------+
| test         | tbtest1    | p10            |  85104 |   106.95 |     10.54 |   117.50 |
| test         | tbtest1    | p5             |  95647 |    57.76 |      8.54 |    66.31 |
| test         | tbtest1    | p4             |  99459 |    57.76 |      8.54 |    66.31 |
| test         | tbtest1    | p3             |  92256 |    57.76 |      8.54 |    66.31 |
| test         | tbtest1    | p2             |  77794 |    57.76 |      8.54 |    66.31 |
| test         | tbtest1    | p9             |  79871 |    57.71 |      8.54 |    66.26 |
| test         | tbtest1    | p7             | 100681 |    57.76 |      7.54 |    65.31 |
| test         | tbtest1    | p1             | 102414 |    56.76 |      8.54 |    65.31 |
| test         | tbtest1    | p6             |  92961 |    56.76 |      7.54 |    64.31 |
| test         | tbtest1    | p8             |  93872 |    56.76 |      7.54 |    64.31 |
| test         | tbtest1    | p0             |  80510 |    56.76 |      7.54 |    64.31 |
+--------------+------------+----------------+--------+----------+-----------+----------+
 
 
 
root@localhost [(none)]> SHOW processlist;
+------+------+---------------------------+------+---------+------+--------+-------------------------------------------------------------------+
| Id   | User | Host                      | db   | Command | Time | State  | Info                                                              |
+------+------+---------------------------+------+---------+------+--------+-------------------------------------------------------------------+
|  144 | root | localhost                 | test | Sleep   |  285 |        | NULL                                                              |
|  195 | root | tusacentral03:56633       | test | Sleep   |   55 |        | NULL                                                              |
|  197 | root | tusacentral07.LOCAL:53368 | test | Sleep   |    1 |        | NULL                                                              |
|  199 | root | tusacentral07.LOCAL:53370 | test | Query   |    1 | UPDATE | INSERT INTO test.tbtest1 (uuid,a,...becordtype)VALUES("2f...95-bc |
|  200 | root | tusacentral07.local:53371 | test | Query   |    1 | update | insert INTO test.tbtest1 (uuid,a,...becordtype)VALUES("0f...d5-a9 |
|  201 | root | tusacentral07.LOCAL:53372 | test | Query   |    1 | UPDATE | INSERT INTO test.tbtest1 (uuid,a,...becordtype)VALUES("ce...94-9d |
|  202 | root | tusacentral07.local:53373 | test | Query   |    1 | update | insert INTO test.tbtest1 (uuid,a,...becordtype)VALUES("32...ec-98 |
|  203 | root | tusacentral07.LOCAL:53374 | test | Query   |    1 | UPDATE | INSERT INTO test.tbtest1 (uuid,a,...becordtype)VALUES("c3...7d-86 |
|  204 | root | tusacentral07.local:53375 | test | Query   |    1 | update | insert INTO test.tbtest1 (uuid,a,...becordtype)VALUES("63...02-ba |
| 1031 | root | localhost                 | NULL | Query   |    0 | NULL   | SHOW processlist                                                  |
+------+------+---------------------------+------+---------+------+--------+-------------------------------------------------------------------+
10 rows IN SET (0.00 sec)

Time to run the optimize, and discover that nothing was changed in all this time.

mysql> ALTER TABLE tbtest1 OPTIMIZE  PARTITION p10;
 
root@localhost [(none)]> SHOW processlist;
+------+------+---------------------------+------+---------+------+---------------------------------+-------------------------------------------------------------+
| Id   | User | Host                      | db   | Command | Time | State                           | Info                                                        |
+------+------+---------------------------+------+---------+------+---------------------------------+-------------------------------------------------------------+
|  144 | root | localhost                 | test | Sleep   |  844 |                                 | NULL                                                        |
|  195 | root | tusacentral03:56633       | test | Query   |   92 | copy TO tmp TABLE               | ALTER TABLE tbtest1 OPTIMIZE  PARTITION p10                 |
|  197 | root | tusacentral07.LOCAL:53368 | test | Sleep   |    0 |                                 | NULL                                                        |
|  199 | root | tusacentral07.LOCAL:53370 | test | Query   |   90 | Waiting FOR TABLE metadata LOCK | INSERT INTO test.tbtest1 (uuid,a,b,c,counter,partit-4cdd-a8 |
|  200 | root | tusacentral07.LOCAL:53371 | test | Query   |   90 | Waiting FOR TABLE metadata LOCK | INSERT INTO test.tbtest1 (uuid,a,b,c,counter,partit-427f-be |
|  201 | root | tusacentral07.LOCAL:53372 | test | Query   |   90 | Waiting FOR TABLE metadata LOCK | INSERT INTO test.tbtest1 (uuid,a,b,c,counter,partit-4694-9d |
|  202 | root | tusacentral07.LOCAL:53373 | test | Query   |   90 | Waiting FOR TABLE metadata LOCK | INSERT INTO test.tbtest1 (uuid,a,b,c,counter,partit-4a5a-a0 |
|  203 | root | tusacentral07.LOCAL:53374 | test | Query   |   91 | Waiting FOR TABLE metadata LOCK | INSERT INTO test.tbtest1 (uuid,a,b,c,counter,partit-48de-b8 |
|  204 | root | tusacentral07.LOCAL:53375 | test | Query   |   90 | Waiting FOR TABLE metadata LOCK | INSERT INTO test.tbtest1 (uuid,a,b,c,counter,partit-4a0f-9c |
| 1031 | root | localhost                 | NULL | Query   |    0 | NULL                            | SHOW processlist                                            |
+------+------+---------------------------+------+---------+------+---------------------------------+-------------------------------------------------------------+
10 rows IN SET (0.00 sec)

MySQL, when using InnoDB, still do the process on all partitions and not only on the selected one.
This is probably not critical for the developer but a customer running it on a huge table could strongly disagree.

Immagine running it against a TABLE with 1.2TB and each partiton of 20-25GB.

It is not funny...

 
total 1031896
drwx------  2 mysql mysql     4096 2011-06-03 13:50 ./
drwxr-xr-x 11 mysql mysql     4096 2011-06-03 13:21 ../
-rw-rw----  1 mysql mysql     8578 2011-03-31 22:44 bbb.frm
-rw-rw----  1 mysql mysql   163840 2011-03-31 22:44 bbb.MYD
-rw-rw----  1 mysql mysql     1024 2011-03-31 22:44 bbb.MYI
-rw-rw----  1 mysql mysql     8578 2011-03-31 22:34 ccc.frm
-rw-rw----  1 mysql mysql   425984 2011-03-31 22:39 ccc.ibd
-rw-rw----  1 mysql mysql       61 2011-03-04 09:47 db.opt
-rw-rw----  1 mysql mysql     8788 2011-06-03 13:50 #sql-3205_c3.frm
-rw-rw----  1 mysql mysql       64 2011-06-03 13:50 #sql-3205_c3.par
-rw-rw----  1 mysql mysql 58720256 2011-06-03 13:51 #sql-3205_c3#P#p0.ibd
-rw-rw----  1 mysql mysql   114688 2011-06-03 13:50 #sql-3205_c3#P#p10.ibd
-rw-rw----  1 mysql mysql 54525952 2011-06-03 13:51 #sql-3205_c3#P#p1.ibd
-rw-rw----  1 mysql mysql   114688 2011-06-03 13:50 #sql-3205_c3#P#p2.ibd
-rw-rw----  1 mysql mysql   114688 2011-06-03 13:50 #sql-3205_c3#P#p3.ibd
-rw-rw----  1 mysql mysql   114688 2011-06-03 13:50 #sql-3205_c3#P#p4.ibd
-rw-rw----  1 mysql mysql   114688 2011-06-03 13:50 #sql-3205_c3#P#p5.ibd
-rw-rw----  1 mysql mysql   114688 2011-06-03 13:50 #sql-3205_c3#P#p6.ibd
-rw-rw----  1 mysql mysql   114688 2011-06-03 13:50 #sql-3205_c3#P#p7.ibd
-rw-rw----  1 mysql mysql   114688 2011-06-03 13:50 #sql-3205_c3#P#p8.ibd
-rw-rw----  1 mysql mysql   114688 2011-06-03 13:50 #sql-3205_c3#P#p9.ibd
-rw-rw----  1 mysql mysql     8788 2011-06-03 13:47 tbtest1.frm
-rw-rw----  1 mysql mysql       64 2011-06-03 13:47 tbtest1.par
-rw-rw----  1 mysql mysql 83886080 2011-06-03 13:51 tbtest1#P#p0.ibd
-rw-rw----  1 mysql mysql 75497472 2011-06-03 13:49 tbtest1#P#p10.ibd
-rw-rw----  1 mysql mysql 83886080 2011-06-03 13:51 tbtest1#P#p1.ibd
-rw-rw----  1 mysql mysql 83886080 2011-06-03 13:51 tbtest1#P#p2.ibd
-rw-rw----  1 mysql mysql 83886080 2011-06-03 13:51 tbtest1#P#p3.ibd
-rw-rw----  1 mysql mysql 83886080 2011-06-03 13:51 tbtest1#P#p4.ibd
-rw-rw----  1 mysql mysql 88080384 2011-06-03 13:51 tbtest1#P#p5.ibd
-rw-rw----  1 mysql mysql 83886080 2011-06-03 13:51 tbtest1#P#p6.ibd
-rw-rw----  1 mysql mysql 83886080 2011-06-03 13:51 tbtest1#P#p7.ibd
-rw-rw----  1 mysql mysql 83886080 2011-06-03 13:51 tbtest1#P#p8.ibd
-rw-rw----  1 mysql mysql 96468992 2011-06-03 13:51 tbtest1#P#p9.ibd
-rw-rw----  1 mysql mysql     8630 2011-06-03 12:36 tbtest2.frm
-rw-rw----  1 mysql mysql  9437184 2011-06-03 13:51 tbtest2.ibd
root@tusacentral03:/home/mysql/instances/master/data/test#

So what to do?

Use REBUILD instead.

This drives me crazy, why the mapping could not be done inside the code?
If InnoDB then OPTIMIZE = REBUILD ... and go ahead.

To me this sounds not user friendly at all.

mysql> ALTER TABLE tbtest1 REBUILD PARTITION p10;
Query OK, 126994 rows affected (16.50 sec)
Records: 126994  Duplicates: 0  Warnings: 0
 
 
root@localhost [(none)]> SHOW processlist;
+------+------+---------------------------+------+---------+------+---------------------------------+-----------------------------------------------------------------+
| Id   | User | Host                      | db   | Command | Time | State                           | Info                                                            |
+------+------+---------------------------+------+---------+------+---------------------------------+-----------------------------------------------------------------+
|  144 | root | localhost                 | test | Sleep   |  357 |                                 | NULL                                                            |
|  195 | root | tusacentral03:56633       | test | Query   |    3 | setup                           | ALTER TABLE tbtest1 REBUILD PARTITION p10                       |
|  197 | root | tusacentral07.LOCAL:53368 | test | Sleep   |    0 |                                 | NULL                                                            |
|  199 | root | tusacentral07.LOCAL:53370 | test | Query   |    2 | Waiting FOR TABLE metadata LOCK | INSERT INTO test.tbtest1 (uuid,a,b,c,counter,partit22da-4c7d-94 |
|  200 | root | tusacentral07.LOCAL:53371 | test | Query   |    2 | Waiting FOR TABLE metadata LOCK | INSERT INTO test.tbtest1 (uuid,a,b,c,counter,partit8459-477b-94 |
|  201 | root | tusacentral07.LOCAL:53372 | test | Query   |    2 | Waiting FOR TABLE metadata LOCK | INSERT INTO test.tbtest1 (uuid,a,b,c,counter,partit93b7-413f-a8 |
|  202 | root | tusacentral07.LOCAL:53373 | test | Query   |    2 | Waiting FOR TABLE metadata LOCK | INSERT INTO test.tbtest1 (uuid,a,b,c,counter,partit6cd0-4e4b-85 |
|  203 | root | tusacentral07.LOCAL:53374 | test | Query   |    2 | Waiting FOR TABLE metadata LOCK | INSERT INTO test.tbtest1 (uuid,a,b,c,counter,partit79c1-4da5-8c |
|  204 | root | tusacentral07.LOCAL:53375 | test | Query   |    2 | Waiting FOR TABLE metadata LOCK | INSERT INTO test.tbtest1 (uuid,a,b,c,counter,partitff17-4be7-b1 |
| 1031 | root | localhost                 | NULL | Query   |    0 | NULL                            | SHOW processlist                                                |
+------+------+---------------------------+------+---------+------+---------------------------------+-----------------------------------------------------------------+
10 rows IN SET (0.00 sec)

Anyhow checking the file system, this time all works as expected and ONLY the partiton p10 was REBUILD.
Please take mental note of the STATE, is saying "Waiting for table metadata lock", this sounds interesting.

drwx------  2 mysql mysql     4096 2011-06-03 13:47 ./
drwxr-xr-x 11 mysql mysql     4096 2011-06-03 13:21 ../
-rw-rw----  1 mysql mysql     8578 2011-03-31 22:44 bbb.frm
-rw-rw----  1 mysql mysql   163840 2011-03-31 22:44 bbb.MYD
-rw-rw----  1 mysql mysql     1024 2011-03-31 22:44 bbb.MYI
-rw-rw----  1 mysql mysql     8578 2011-03-31 22:34 ccc.frm
-rw-rw----  1 mysql mysql   425984 2011-03-31 22:39 ccc.ibd
-rw-rw----  1 mysql mysql       61 2011-03-04 09:47 db.opt
-rw-rw----  1 mysql mysql     8788 2011-06-03 13:47 #sql-tbtest1.frm
-rw-rw----  1 mysql mysql       64 2011-06-03 13:47 #sql-tbtest1.par
-rw-rw----  1 mysql mysql     8788 2011-06-03 13:44 tbtest1.frm
-rw-rw----  1 mysql mysql       64 2011-06-03 13:44 tbtest1.par
-rw-rw----  1 mysql mysql 83886080 2011-06-03 13:47 tbtest1#P#p0.ibd
-rw-rw----  1 mysql mysql 75497472 2011-06-03 13:46 tbtest1#P#p10.ibd
-rw-rw----  1 mysql mysql   606208 2011-06-03 13:47 tbtest1#P#p10#TMP#.ibd
-rw-rw----  1 mysql mysql 83886080 2011-06-03 13:47 tbtest1#P#p1.ibd
-rw-rw----  1 mysql mysql 83886080 2011-06-03 13:47 tbtest1#P#p2.ibd
-rw-rw----  1 mysql mysql 83886080 2011-06-03 13:47 tbtest1#P#p3.ibd
-rw-rw----  1 mysql mysql 83886080 2011-06-03 13:47 tbtest1#P#p4.ibd
-rw-rw----  1 mysql mysql 83886080 2011-06-03 13:47 tbtest1#P#p5.ibd
-rw-rw----  1 mysql mysql 83886080 2011-06-03 13:47 tbtest1#P#p6.ibd
-rw-rw----  1 mysql mysql 83886080 2011-06-03 13:47 tbtest1#P#p7.ibd
-rw-rw----  1 mysql mysql 83886080 2011-06-03 13:47 tbtest1#P#p8.ibd
-rw-rw----  1 mysql mysql 92274688 2011-06-03 13:47 tbtest1#P#p9.ibd
-rw-rw----  1 mysql mysql     8630 2011-06-03 12:36 tbtest2.frm
-rw-rw----  1 mysql mysql  9437184 2011-06-03 13:46 tbtest2.ibd
root@tusacentral03:/home/mysql/instances/master/data/test#

While there and to avoid doubt I also checked what about CHECKING a partition,

Who knows may be some surprise as well

mysql> ALTER TABLE tbtest1 CHECK  PARTITION p10;
+--------------+-------+----------+----------+
| TABLE        | Op    | Msg_type | Msg_text |
+--------------+-------+----------+----------+
| test.tbtest1 | CHECK | STATUS   | OK       |
+--------------+-------+----------+----------+
1 row IN SET (36.64 sec)
 
 
root@localhost [(none)]> SHOW processlist;
+------+------+---------------------------+------+---------+------+--------+----------------------------------------------------------+
| Id   | User | Host                      | db   | Command | Time | State  | Info                                                     |
+------+------+---------------------------+------+---------+------+--------+----------------------------------------------------------+
|  144 | root | localhost                 | test | Sleep   | 1114 |        | NULL                                                     |
|  195 | root | tusacentral03:56633       | test | Query   |   15 | NULL   | ALTER TABLE tbtest1 CHECK  PARTITION p10                 |
|  197 | root | tusacentral07.LOCAL:53368 | test | Sleep   |    0 |        | NULL                                                     |
|  199 | root | tusacentral07.LOCAL:53370 | test | Query   |    0 | UPDATE | INSERT INTO test.tbtest1 (uuid,a,b,c,counte-39e0-4296-97 |
|  200 | root | tusacentral07.LOCAL:53371 | test | Query   |    1 | UPDATE | INSERT INTO test.tbtest1 (uuid,a,b,c,counte-f508-4ec8-87 |
|  201 | root | tusacentral07.LOCAL:53372 | test | Query   |    1 | UPDATE | INSERT INTO test.tbtest1 (uuid,a,b,c,counte-3a5f-4444-81 |
|  202 | root | tusacentral07.LOCAL:53373 | test | Query   |    1 | UPDATE | INSERT INTO test.tbtest1 (uuid,a,b,c,counte-970e-48bb-a2 |
|  203 | root | tusacentral07.LOCAL:53374 | test | Query   |    0 | UPDATE | INSERT INTO test.tbtest1 (uuid,a,b,c,counte-62ea-4c1f-81 |
|  204 | root | tusacentral07.LOCAL:53375 | test | Query   |    0 | UPDATE | INSERT INTO test.tbtest1 (uuid,a,b,c,counte-6277-4dd5-89 |
| 1215 | root | localhost                 | NULL | Query   |    0 | NULL   | SHOW processlist                                         |
+------+------+---------------------------+------+---------+------+--------+----------------------------------------------------------+
10 rows IN SET (0.00 sec)

But luckly no it works fine and no lock on the table.

Conclusions

1) We still have to deal with this non critical bug, and we must remember
DO NOT USE:

ALTER TABLE <table_name> OPTIMIZE PARTITION A,B,C

BUT

ALTER TABLE <table_name> REBUILD PARTITION A,B,C

Not yet clear to me why a simple mapping between REBUILD and OPTIMIZATION is not possible.

2) The level of lock is related to METADATA because the ALTER command, looking around for a better explanation I have seen that there is not too much material about it, but the presentation from Konstantin Osipov for sure helps to better understand.
http://en.oreilly.com/mysql2011/public/schedule/detail/17340

My dream

The OPTIMIZATION of a partition is a special case, where data, is not modified but eventually physically re-organized.
Having the status of the lock declared as METADATA lock, brings me to think we can safely insert/modify all data not related to that partition.
What I am feeling from all this, is the possibility to have in the near future this lock release, I see issues with the Index merge and others so some digging in the code will require..

It would be nice ...to have it.

Reference

17.3.3. Maintenance of Partitions - http://dev.mysql.com/doc/refman/5.1/en/partitioning-maintenance.html
Metadata locking and deadlock detection in MySQL 5.5 - http://en.oreilly.com/mysql2011/public/schedule/detail/17340
optimize partition recreates full table instead just partition - http://bugs.mysql.com/bug.php?id=42822
Make partition maintenance commands more efficient - http://bugs.mysql.com/bug.php?id=46158

My afterthoughts on teams compose by multi-cultural, multi-language elements, in consulting and support.

This is not a technical article; it starts from my own thinking related to the work leaded in many years in international environments and with people coming from all around the world. Many though, about me, about the interaction with colleagues, about interactions with customers, may be few dolphins here in terms of bit, but not in terms of spirit, human respect and the research of being good.

teammulti3

My thinking and I

I am a lucky guy; I had the chance to travel a lot from when I was really young, so young that I cannot remember my first time. I have done mine the natural fact that we are different and we have different approaches in human relations.

Once become adult I was even luckier, because I started to work in international environment, so I had the chance to learn many different “cultural” approaches from the source on the top of my travels.

What have always amazed me, are the many subtle differences existing between people also leaving close each other, and very tight similarities also if staying from the other side of the world.

There are so many that anthropologically is very difficult to classify them all. But I am not an anthropologist so I can do mistake there and in my mind I have built up a differentiation, that refining to the extreme reduce the whole to two categories, not the good and the bad or we and the others, given we sometime shift from one category to the other during our day.

But let us say that it is a common behaviour during our day to define if “that” day we were here or there.

What I am talking about is the different approach that we can call the “One man hero” & “family man”.

The One man/woman Hero

He is he, now way to be better. He save the world alone, he is firmly convinced to be (almost) always right. Whatever comes out from his mouth is holy true, what is in his mind no human can even imagine. If he stops doing the entire world stops as well.

When talking with others no need to take care of any debate, just listen to him. Teams should follow him try to remain in his shadow.

He makes no mistakes, but if he does, he fell.

The Family Man/Woman

We must do it together, he has his own ideas and he is sure he is right, but you can be right as well. He proposes ideas not impose, want to hear from you, your concern, or thinking, especially if not in line. He learns from you, you learn from him. He tries to keep his space/territory in order respecting his own attitude; if you do the same the world will be ok.

He can fell, but the “family” will stay up, and help him to rise again.

What and how this fits in consulting/support?

Well in many ways, starting from the interactions in the working place. It is funny to note that, any or almost any, company would like to have “the family man” on board, given the attitude to convince people not to impose, the capacity of sharing and so on. Also when he is sick or if he goes away the family survives.

On the other hands customers, often, want to have the hero with the magic wand. He comes save the world, takes the money and goes. What a big relief for them.

If the above is true, at the end the situation will not be so dramatic. We can act as family man inside our company and as the one man hero when dealing with the external.

Given life is fun, the reality is much more complicated, and the infinite combination starts putting together different types in the same room. So you can have a lot of fun if behave as the omh (one man hero) and in the same room on the other side there is another one, who will wins, who will be the one having the right answer? On the other hands having a room filled with fm(family man) will bring to very interesting long discussions, but it is possible that for the time a solution is found the humans are extinct.

So as usual the right measure and the right combination of the two elements is the right path to go.

There are different points that must always take in consideration from my point of view:

1. No one is better than someone else

2. Stating to be good is not enough we must be good

3. Shit happens

4. Hiding shit bring more shit

5. Learning from mistakes makes you stronger

6. If you do not make mistakes you are lying

7. Explain your ideas allow you to have better ideas

8. Propose not impose

9. If you know you right, be ready to accept that also someone else can be right as well

10. If you know and can prove he is wrong explain not denigrate

The funny thing is considering the same points from the customer point of view:

1. Better for you to be better than me

2. You must be good, period

3. Shit don’t happens if it does your fault

4. If shit happens and it is my fault, then it is your fault because you did not tell me the right thing

5. Your mistakes should never happen on my environment

6. Better you know what you do so train yourself out of my “house”

7. You must have good ideas always

8. You must be sure of what you say

9. If I know I am right I am right

10. If you can prove I am wrong, say it such that no one will get offended

In any case the golden rules for me are:

1. Say always the true to the customer or colleague

2. Base your suggestion and recommendation on observation and real facts

About 2, you can have inspiration that is good, but ten you must prove it.

The list can/should be longer, but what I mean here is that there are different approaches when we are acting as Consultant/Support or as customer, and we shift from one point to the other without even thinking.

If you say no, think when YOU are calling for assistance because your cable TV is not working and you are going to lose your favorite match, or because the Internet connection drops.

Back to the point please

But return back to the point, dealing with International consulting/support.

As said we must be good, and I am firmly convinced that if we act “Always” as a family man with our colleagues, be good will be a shared efforts and to achieve good results will be a lot easier.

The International teams first

Family’s teams

Let us start from here; there are culture that see the family approach as must, you cannot see a good result as your result is a team result, as well the failure. We have this in Japan, Asia in general, but it is becoming more and more frequent in Europe as well. We have some “strange” behavior in country like Italy and Spain, but normally it works.

When I say strange I mean the “passing the buck” approach that unfortunately is very common in latin countries.

In the same context also the physical behavior and how the voice is used is different, avoiding to being rude or too direct; instead being more descriptive and verbose.

One Man Hero’s team

A different approach is used in some companies, where the leadership is still considering the OMH behavior a benefit. It is easier to find such approach in North America and European companies where the leadership is still bounded to old Top->Down models. There the people or resource (I hate that term but it is a good indicator of the company approach) composing the teams, are often referring to someone above them in the pyramid, not interacting directly. The results are shared bottom up, then top down. Not following an internal exchange. This is very important because the awareness re of the internal problem is not equally shared. Being part of group as such expose you to a lot of possible mistakes because bad communication.

My preference and thoughts

It is quite obvious that from the “internal” operation I put my vote on the FM and not on the other, period.

The difficulty exists in keeping all the people align and trusting each other. There will always be, especially when setting up a new team, someone that feels himself as the ONE. Not because he is bad but because the culture, the background and the lack of trust in the others.

This is probably the most important and crucial point, trusting each other in a team is not easy, requires more courage than acting alone.

Building up relation between different people is easier when you can have them in the same space at the same time, but it becomes a heavy task when happening with people distributed around the globe. The differences start from the working hours, the holidays they have, the need to talk they have, and so on.

Again I was lucky, I had my first international team, when I was able to handle it directly, so at least the Time zone issue was reduced to minimum. But the respect you must have for their differences is the same, so different holidays, different working time, different culture that require to be taken in account. When happens to me to deal with remote team, the first time, wow that was fun, just to organize the kick-off meeting I take at least 48 hours. No was not easy to deal with it in a “family mode” and this is also why very often some companies goes for the OMH, choosing to impose instead propose.

Here I would like to remember to all a fantastic statement: “People do not leave their jobs, they run away from their boss”. Keep an eye on places where people are leaving, there you will probably find a bad manager, and I have seen that recently happening.

Now about Customer

It is now time to see how we can combine an international team with all his internal interactions with customers, sometime international, sometime no.

Here must distinguish, Consulting and Support are two different categories, or to be more precise, onsite Consulting and remote work, two different way to interact with customer.

Dealing with customer onsite

When you go on site at customer, also if you are a family man, be careful to interact in the correct way, do not be too flexible, given that could be misunderstood and read as weakness or worse. You are alone, no family with you, you must trust yourself fully. Better to switch to the One Man Hero mode, the lions will come and they want your body. This is true for the first 5 minutes, where the first 10’’ gives you the chance to evaluate how much you can be flexible or not at all.

What happens if you have in front of you a team headed by an OMH? I have seen a constant behavior here; OMH meets you alone, before introducing you to the rest of the team. The first aspect that you should consider is if he is interacting with you at your working level or if he is there to coordinate. If he is a manager, be careful it doesn’t matter what nationality he will be, he will not understand you if you talk about technical aspects (few cases do, very few), and he will not understand the explanations, skip them, go straight to the Executive Summary. If he is a working level guy, then culture, nationality and background could make the difference. In some cases you must be dramatically direct close to be arrogant, and say what you have to say, like I have seen working in many cases with US people. But if you have in front of you a Latin guy or someone from Arabic countries, then do not do it. You will have an enemy before even open your mouth. Let him be opening all the conversation and discussions, try to keep all the conversation following indirect patterns; only when you are sure he start to trust you, start to gently propose him that there could be a “different” way of doing things. Then remember if he will reduce the talks with you, means you have a problem, try to fix it talking in 1:1 never with others.

In those context if the customer has a mixed team, follow the internal dynamics will helps a lot.

In short dealing with a customer face by face is primarily a matter of human experience, and then comes how good you are, but you know that you must be good, period.

Dealing with customer remote

Two level of contact here, using any system or by voice.

The golden rule here is … “Scripta manent verba volant” (Google … if you don’t know Latin).

System or email

Systems, including emails are done to reduce the overhead created by the human interaction, and to optimize relevant information exchange. When you have to exchange information with the customer by email, or by ticketing system, nationality, culture and background are not so relevant. What is relevant is to be clear, and descriptive, period.

Good customer/managers know that, and tempt to push you to use that way, when communicating with them.

Never the less the use of system could hide some pitfalls and you must be careful trying to at least be aware of the possible issues. There is not a real way to identify what kind of team the customer has (base on the FM or OMH model), but few elements could help: when you see only one contact point, or many different one for different topic, when you don’t see a common account used, then be careful, behind the scene is lurking a team managed by an OMH focus on lack of communication. On the other hands good sign is when they gives you one contact (email, ticket system, tracing tool) that is used in parallel by multiple team member, very likely they communicate each other, and so you can as well.

Voice

Huston … we have a problem. You can only guess the first time with who you are dealing with. This is why you should never ever, talk directly with the customer. The sales guys or whoever talks with customer as main contact, must first brief you. He should be able to say to you (if he do not, suggest to fire him) how to interact, and you should have a place where take note on customer information, and share that with the other members of the teams.

Voice and conference calls, are the preferred tools of customer that have very basic understanding of distributed work and international teams. Normally they have no idea how confusing a conference call could be when coordinating real work in case of problem.

When dealing with customer around the world we have to keep in mind that not all of them talk good English, or English at all, as well teams members, from both sides. This is one of the main issues during conference call, after line quality, for misunderstanding between us and customer.

For geographic and language reasons the North America is the one mostly using it. The problems start when they forgot there are others 201 countries in the world, 6000 spoken languages, where the most used (excluding Chinese) is Spanish with 329 million vs English 328 million, and last but not least there are other 20Hrs in the Time Zones and people work there too, and sometime sleep.

In short, keep an eye on the world clock and whenever there is the need of a call show some respect and at least talk slowly and clear, avoiding slang or too many jokes.

That’s it, multi-cultural, multi-language teams should not have conference calls with customer too often or at all, but the coordination level should. Those calls should be focus on planning, and customer review, nothing more all other communication should be written.

Doing that will help preventing any friction between the different ways oh behave.

Lesson learn

There is always something to learn. In this case I think is that the power, flexibility, creativity and much more obtained having a team compose by multi-cultural elements, is so more than the difficulties to manage it that also companies not dealing with international customer should go for it.

Language and cultural differences could be an issue but is possible to handle them; the only require ingredient is showing a little bit of respect and ask for it as well.

The problem is keeping the beast under control specking about OMH, but someone more expert then me had already spoke about it in the MUST READING book: “Good Boss, Bad Boss” by Robert I. Sutton.

Conclusion

What conclusion? There is not a conclusion; I have just put in write a week-end/Easter thought.

The only thing I would like to stress a bit is that whenever there are teams based on multi-cultural members, from Service provider side or customer side, the standard way of acting are not working anymore.

The key there is respect, once more respect, no one is better, we are different and we must respect that. There is no need to do huge things, small simple ones are good enough. Talk slowly when talking, respect other people traditions: don’t ask for consultancy on 26 of December, don’t plan to send your Guru to Cairo on Friday and so on.

Our world is a common net from many points of view, but still lack in understanding how different is dealing with company, teams and customer limited to a specific territory and/or country; comparing with the company, team and customer distributed around the globe.

Excellent managers that work great and achieve great results in UK (or Italy, US, France, Japan and so on) for company working within the country with people comeing only for that country, miserable fails if require to lead the same role in a distributed, international environment.

Dealing with international teams and customer, require specific background. Managers, team leader, Coordinators that are not use to work in an international environment , should not be place in a leading position given they total lack of experience in handling correctly the PEOPLE, not resources, but PEOPLE.

One clarification, a company that has 100 offices in 100 countries and use a vertical structure, is not an international distributed company; it only has many boxes, period.

{joscommentenable}

MySQL Search facility is gone

MySQL Search facility is gone …

Well it happens, after a long period of time without one second of relax The MySQL Dolphin Search, nick name Fetch for the friends, decide to go on vacation.

sakila_beach_256x213

He was already not working as expected, and I remember times when together with other colleagues we were pushing on him to be more precise, and less restrictive, more “concept” oriented than “word” oriented. Not really accepting comments he was his way, and in any case we were loving him.

What happen recently I don't know, from when I left I didn't had the chance to talk with him too much, but just recently I was in the need to have his support, but he was not there... instead there was a message, saying that a search for “InnoDB” :

“Skip navigation links

Oracle Secure Enterprise Search

* Help

* Search

Oracle Secure Enterprise Search Error

A problem has been encountered while processing your request.

Based on the information we have, we suspect the following:

The exception did not contain any error message. Report this problem to your administrator, or contact your Oracle representative.“

Of course I can use his friend G, but you know I miss him, and also if I was not too nice with him sometime, I have just realize the great work he was doing, and how much added value he was providing to the documentation I use every day.

So, please whoever knows where he went for vacation, tell him to comes back.

{joscommentenable}

A dream on MySQL parallel replication

The Binary Log

The binary log in MySQL has two main declared purpose, replication and PTR (point in time recovery), as declared in the MySQL manual. In the MySQL binary log are stored all that statements that cause a change in the database. In short statements like DDL such as ALTER, CREATE, and so on, and DML like INSERT, UPDATE and DELETE.

As said binary log transform multi thread concurrent data/structure modification in to a serialize steps of activity. To handle this, a lock for the binary log—the LOCK_log mutex—is acquired just before the event is written to the binary log and released just after the event has been written. Because all session threads for the server log statements to the binary log, it is quite common for several session threads to block on this lock.

The other side effect of this is on SLAVE side. There the binary log becomes the relay-log, and will be processed SINGLE threaded.

In the MASTER it looks like:

binlog_today

While on SLAVE:

relaylogtoday

It is quite clear how the SLAVE cannot work efficiently in the case we will have a high level of data modification on the MASTER.

And now the story

The other day I was working on a client site, I was performing some simple benchmarking to identify the correct thresholds for monitoring tools and so on.

While doing that I start to stress a little his server a 24CPU 48GB RAM SAN attached storage running Linux kernel 2.6.

The client had decided, before I or (we) can say anything to use MMM for HA, given that to have an up to date replication is a crucial factor for their HA architecture.

The test I was running execute inserts and selects, of 650 Bytes per insert using scalable number of thread, modifiable complexity in the SELECT, can do single insert or batch inserts and finally can do also BLOB/TEXT inserts (in that case the size per insert will jump up).

I was executing my standard low traffic exercises, when I realized that with only 6 threads doing 318 inserts per second (204Kb/s) the slaves were lagging behind the master for an indecent number of seconds.

I was expecting some lag but not so much. I stop any activity on the Master and wait for slave to recover, it takes 25 minutes to the slaves to recover/apply 20 seconds of inserts on the master.

I was shocked. That was too much considering I normally reach the 36 threads running at the same time on 5.1 and 144 on 5.5 before starting to see scaling issue and performance degradations (not in the scope of this article to publish that results).

Anyhow the issue was on replication not on Master. I was in the need to find a solution for the client or at least try before suggesting an architecture review.

Given re-inventing the wheel it is not my “motto” I started to look around for possible solutions, and parallelization seems the most interesting one.

My first though goes to Tungsten from Continuent, this also thanks to the very nice bogs coming from Giuseppe Maxia :

http://datacharmer.blogspot.com/2011/02/advanced-replication-for-masses-part-i.html

http://datacharmer.blogspot.com/2011/02/advanced-replication-for-masses-part-ii.html

http://datacharmer.blogspot.com/2011/03/advanced-replication-for-masses-part.html

Giuseppe was providing a superb introduction on one of the most interesting (actually the only really usable) solution currently available, the “Tungsten Replicator”.

I was reading the articles and get excited more and more.
That was exactly what I was looking for, until I read “A Shard, in this context, is the criteria used to split the transactions across channels. By default, it happens by database”.

Ooops, it will not work for my customer and me, given that his instance is mainly single schema, or at least the major amount of traffic is directed to only one schema.

I pinged Giuseppe, and we had a very interesting talk, he confirms the approach by object/schema, suggesting alternatives for my issue. Unfortunately given the special case no sharding could be done or table distribution, so for the moment and for the specific case this solution will not fit.

Looking around I had seen that almost all use the same approach, which is parallel replication by OBJECT where object is again and unfortunately, mainly the SCHEMA.

Sharding replication by schema is an approach that could work in some cases but not all.

If you have many schemas, with a limited number of well-distributed Questions per second, I agree this solution is a very good one.

If you have only one schema with very high level of Questions per seconds, this will not work anymore, for sure we can distribute the tables in different schemas, but what about transactions doing inserts/modifications in multiple tables?

We should be sure that each action is running on isolate state, so no relation with others. In short it will not work.

Having the possibility to dream, I would like to have parallel replication by MySQL User/Connections.

I know it could looks as a crazy idea, but trying to replicate the behavior of the MASTER on the SLAVES starts from the user(s) behavior, not from the Object inside the server instance.

Trying to move objects from A to B (see from MASTER to SLAVE) is really complicate because it needs to take in to account how people->application behave on a specific OBJECT or set of them. The status of the OBJECTS is subject to constant changes and most of them at the same time, so concurrency and versioning needs to be preserve.

In short what I was thinking was to realize a conceptual shift from REPLICATION to MIMIC approach.

I started my dream looking on what we already have and how it could be used.

We have a flexible binary log format (version4), we have a known set of SCHEMAS, USERS, and connections going from users to schemas. We have (at least for Innodb) the possibility to have good concurrent and versioning mechanism. The MAX_USER_CONNECTIONS can be used to tune the number of threads the specific user can use, and assign also the permissions.

Setting MAX_USER_CONNECTIONS=30 to USER1@ will help us in correctly dimensioning the effort that replication by user will have to do in checking for consistency.

I am aware of the issue given the interaction and relation between user/transactions, and I see the need of having a coordinator that will eventually manage the correct flush of the transaction by epoch.

What I was thinking is that the coordinator on the slave could open as much as MAX_USER_CONNECTIONS on the master, and use them as flushing threads.

As said before, I was thinking to trust the internal versioning, so not supporting transaction engines are excluded.

The epoch and/or Transaction ID would then be used to flush them in correct order, where the flush will be done to the pages in memory and then use internal mechanism for flushing to disk(s), talking of InnoDB.

It will not be real parallelization but also not a serialization given we will trust the internal versioning, so we will respect the EPOCH for the launch order not waiting for the completion.

On MASTER the replicator agent will create an X number of Replication channel for each authorized/configured user per SCHEMA, users and permission must be the same also on SLAVE.

The binary log write could looks like:

binlog_onlyuser

On the SLAVE, the logs will be copied and then processed opening 1 to MAX_USER_CONNECTIONS Thread(s).
Coordination between insert will be taken by replication MUTEX per User based on the EPOCH.

relayloguser

I know quite difficult to realize given the fact that we will have more users writing on different schemas. To have it working correctly, a MUTEX should check the correct process execution at Relay Log Coordinator, ensuring also a correct execution order base on the EPOCH.

Sounds like another serialization, also if we will not wait for the transaction to complete, but only to start.

It seems that a mixed approach will be a better solution, such that replication by SHARD will be used for DB and then per USER inside only by SHARD. This means archiving action per schema and per user transaction. Replicating the work on the SLAVE ensuring to have the same start order of the transaction for the modifications. This is not so crazy as most can think and not so far from sharding by schema, except the multi-threading during the writes.

Main point will be archiving correctly the information in the binary log:

binlog_schema_user

And execute them in the right order as well, keeping also into account the User/Transaction order:

relaylog_per_schema_user

Relay logs will contain information divide by SCHEMA. Processed using multi thread per User/Connection and organize with reference to the epoch of the starting execution order on the MASTER. Different grades of parallel or serialization execution could be set on the how the single Schema relay coordinator will work.

I know it will not be full parallelization but is not serializing as well, given the possible parallel execution of N transaction per user.

I think that there could be some risk here so in the case we should be able to:

- Configure the number of executing threads per user

- Execute only one thread per user

- Choose between EPOCH execution and on Lock relief execution

Fast/short comment on the new binary log.

From binary log V4 we have the possibility to extend what the binary log store in a flexible way.

The extra headers (from byte 19 to X) are there but not used, while that sounds to me the most interesting place where to store additional information regarding what and how happen on the MASTER.

Not only, an HASH value representing the data on the master after the operation, would be of incredible help in checking if the replicated data on the SLAVE is consistent with the MASTER or not, avoiding to do that with silly methods like we are force to do today.

Below the binary log V4 format description:

v4 format description event (size >= 91 bytes; the size is 76 + the number of event types):

+=====================================+
| event  | timestamp         0 : 4    |
| header +----------------------------+
|        | type_code         4 : 1    | = FORMAT_DESCRIPTION_EVENT = 15
|        +----------------------------+
|        | server_id         5 : 4    |
|        +----------------------------+
|        | event_length      9 : 4    | >= 91
|        +----------------------------+
|        | next_position    13 : 4    |
|        +----------------------------+
|        | flags            17 : 2    |
+=====================================+
| event  | binlog_version   19 : 2    | = 4
| data   +----------------------------+
|        | server_version   21 : 50   |
|        +----------------------------+
|        | create_timestamp 71 : 4    |
|        +----------------------------+
|        | header_length    75 : 1    |
|        +----------------------------+
|        | post-header      76 : n    | = array of n bytes, one byte per event
|        | lengths for all            |   type that the server knows about
|        | event types                |
+=====================================+

v4 event header:
+============================+
| timestamp         0 : 4    |
+----------------------------+
| type_code         4 : 1    |
+----------------------------+
| server_id         5 : 4    |
+----------------------------+
| event_length      9 : 4    |
+----------------------------+
| next_position    13 : 4    |
+----------------------------+
| flags            17 : 2    |
+----------------------------+
| extra_headers    19 : x-19 |
+============================+

Conclusion

My favorite mottos are:
“only the ones that do nothing do not make mistakes”

and

“don’t shoot on the pianist”.

As I say from the beginning mine is a dream, and I am just proposing an idea, specially because I see that almost all were following the same mind path focused on OBJECTS, probably for a good reason, but never the less I am following my dream.

It could happen that someone would like to see/try to work on that or that will use part of the idea revisiting modifying and making it possible to implement.

For the moment it is too complex and taking too much time, for me to do it alone, but it could be a good idea for a plug-in in my next life.

References:

In addition to Giuseppe blog, which for now it seems to me the only one proposing something that could be implement, I did some research, but at the end the actors are always the same:

Percona

http://www.percona.com/docs/wiki/percona-server:blueprints:parallel-replication

http://www.mysqlperformanceblog.com/2007/01/30/making-mysql-replication-parallel/

MySQL forge

http://forge.mysql.com/worklog/task.php?id=5569

Other

http://feedblog.org/2007/01/30/design-proposal-for-multithreaded-replication-in-mysql/

http://scale-out-blog.blogspot.com/2010/10/parallel-replication-on-mysql-report.html

Note

I already received a comment fromGiuseppe about how to handle the Locks and prevent dead lock... working on that (I have some ideas and details will provide soon).

{joscommentenable}

Binary log and Transaction cache in MySQL 5.1 & 5.5

The Binary Log

for several session threads to block on this lock.

It is not in the scope of this discussion to provide a detailed description of the binary log, which could be found at http://dev.mysql.com/doc/refman/5.1/en/binary-log.html.

What it is relevant is that MySQL write "a" log and read "a" log, serializing what is happening in the MASTER in a multi thread scenario. Also relevant is that MySQL has to deal with different behavior in relation to TRANSACTIONAL and NON-TRANSACTIONAL storage engine, trying to behave consistently.

In order to do that MySQL had changes his behavior while evolving, so it changes the way he was and is using the TRANSACTION CACHE also creating not few confusion.

Transaction Cache

What is the Transaction Cache or TC? It is a "per thread" cache used by MySQL to store the statements that will be needed to be written in the binary log.

This cache is active ONLY when the Binary log is active, otherwise it will not be used, and, as it declare, it is a "Transaction" cache so used to store statements executed in a transaction, and that will be consistently flushed in the binary log serializing the multi thread execution.

But what if we also have non transactional storage engine involved in our operations? Well here is where the fun begins, and also where the difference in how MySQL acts makes our life quite complicate.

Actors involved (who control what)

The TC (transaction Cache) is currently controlled by few parameters, and it is important to have them in mind:

log-bin

o this variable is to enable the binary log, if not present all the structure related to it is not active;

binlog_cache_size

o The size of the cache to hold changes to the binary log during a transaction

binlog_stmt_cache_size

o this variable determines the size of the cache for the binary log to hold nontransactional statements issued during a transaction.

binlog_direct_non_transactional_updates

o This variable when enabled, causes updates to non-transactional tables to be written directly to the binary log, rather than to the transaction cache.

Binlog_cache_use vs Binlog_cache_disk_use

o Status Variables to use for checking the efficiency of the cache.

Binlog_stmt_cache_use vs Binlog_stmt_cache_disk_use

o Status Variables to use for checking the efficiency of the cache.

MySQL Transaction cache and MySQL versions

Ok let start trying to define some milestones, like what is what, in which version, and what works and what doesn't.

binlog_cache_size is present from the start but behave different, in:

5.1 controls the whole transaction cache

5.5.3 up to 5.5.8 controls both size of the statement and none cache

5.5.9 and newer controls only the transaction cache demanded to transactions

binlog_stmt_cache_size

5.1 not present

5.5.9 it controls the dimension of the transaction cache demanded to store the NON transactional statements present in a transaction.

binlog_direct_non_transactional_updates

5.1.44 Introduced. it is DISABLE by default. If enable it allows all statements executed, also AFTER a Transaction (we will see later what this means), to be directly flushed to the binary log without going to the TC. It has effect only if the binlog_format is set to STATEMENT or MIXED.

5.5.2 Introduced.

5.5.5 It was modify to be active ONLY if binlog_format is STATEMENT in any other case it is ignored.

This feature, regardless the MySQL version, create more problems then solutions, and honestly I think it should be removed, please see the bug (http://bugs.mysql.com/bug.php?id=51291) to understand what I mean. In any case it has now a very limited scope of action, and better to consider it as not present.

How the whole works

The first point to take in mind, is that the TC is used only during a transaction, also the new TCStatement, is suppose to store only statement that are issued during a transaction, and it will be flush at the moment of the COMMIT, actually it go the queue for acquiring the LOG_LOCK mutex.

All other statements will directly go to the LOG_LOCK mutex queue.

Said that let us see how the TC works and what it does and how.

In general terms when you mix the Transactional and NON-transactional statements these rules apply, unless the infamous

binlog_direct_non_transactional_updates is enable (only 5.1):

1. If the statement is marked as transactional, it is written to the transaction cache.

2. If the statement is not marked as transactional and there are no statements in the

transaction cache, the statement is written directly to the binary log.

3. If the statement is not marked as transactional, but there are statements in the

transaction cache, the statement is written to the transaction cache.

Interesting point here is what and how is define as NON Transactional or as TRANSACTIONAL, from reading the manual and the many reference we have that:

After MySQL 5.1.31

a statement is considered non-transactional if it changes only non-transactional tables;

a statement is transactional if it changes only transactional tables.

Finally a statement that changes both non-transactional and transactional tables is considered “mixed”. Mixed statements, like transactional statements, are cached and logged when the transaction commits.

Before MySQL 5.1.31

a statement was considered non-transactional if the first changes it makes change non-transactional tables;

a statement was considered transactional if the first changes it makes change transactional tables.

In term of code:

Before MySQL 5.1.31

Transactional:

BEGIN;

INSERT INTO innodb_tbl VALUES (a),(b);

INSERT INTO myisam_tbl SELECT * FROM innodb_tbl;

COMMIT;

NON-Transactional (the first command also if inside a declared transaction):

BEGIN;

INSERT INTO myisam_tbl VALUES (a),(b);

INSERT INTO innodb_tbl SELECT * FROM innodb_tbl;

COMMIT;

After MySQL 5.1.31

Transactional:

BEGIN;

INSERT INTO innodb_tbl VALUES (a),(b);

INSERT INTO innodb_tbl SELECT * FROM innodb_tbl;

COMMIT;

NON-Transactional:

INSERT INTO myisam_tbl VALUES (a),(b);

INSERT INTO myisam_tbl SELECT * FROM innodb_tbl;

Mixed:

BEGIN;

INSERT INTO innodb_tbl VALUES (a),(b);

INSERT INTO myisam_tbl SELECT * FROM innodb_tbl;

COMMIT;

A graph could help as well to understand what happens between threads and flush of the TC in the binary log.

In the graphs we have three treads running in parallel, remember that inside mysql the main entity is the THD "The THD class defines a thread descriptor. It contains the information pertinent to the thread that is handling the given request. Each client connection is handled by a thread. Each thread has a descriptor object."

The first thread will do all the operations using TRANSACTION storage engine (InnoDB);

the second thread will use NON-TRANSACTIONAL (MyISAM) in the first insert, then TRANSACTIONAL (InnoDB);

the third statement will do first a TRANSACTIONAL insert and then will use it for a NON-TRANSACTIONAL.

In the first one I have assume the 5.1.31 and over behavior.

--------------------------------------------------------------

The three statements start almost at the same time, and none of them has data in any cache or buffer. Said that it is probably true that statement 1 in Thread2 will be executed (fully) faster, given it will take less as overhead, and giving that is the first statement in the THD2 transaction, it will be treated as NON-TRANSACTIONAL and directly sent to the binary-log's queue.

THD1 is fully TRANSACTIONAL, it will do his internal action then flush to binary-log.

Then THD2 had done also the second part of the operation TRANSACTIONAL and it is flushing it as well.

Last the THD3 which had first insert a record from TRANSACTIONAL table and then use it to populate the NON-TRANSACTIONAL.

Note

It looks more or less fine, isen't it? Well but assume that all the three THD share the InnoDB table and that as per default the ISOLATION LEVEL is Repeatable Read... Do you get it??

Actions and value on the MASTER will do and set values, that are potentially different from the ones in the SLAVE given the different order in the execution.

Immagine to have updates, instead insert and/or WHERE condition by values. Results could be different given the position, and remember, we are still talking about the STATEMENT replication given it is still the default ALSO in 5.5.

Finally remember that the issue it is NOT that THD1/2/3 could potentially change each other value, but the fact that they do something on the MASTER which could potentially different in the SLAVE.

That's it, the changes introduced in MySQL were done to reduce this wrong behavior, or to in some way try to keep consistency.

Let see what and how in the others graphs.

What about using binlog_direct_non_transactional_updates, will it helps?

I admit that I have exaggerated a little bit, in this example but my intention was to show how dangerous this option could be.

THD2 statement 1, as usual will be the first one, then THD1 consistently apply the whole set, but then it could happens that given the option set binlog_direct_non_transactional_updates THD3 will be faster in flushing the NON-TRANSACTIONAL statement, writing to the binary log before THD2, adding fragmentation and breaking the order in the binary log.

As said I have exaggerate a little, but theoretically this is very likely to happen in a standard context.

So in addition to the already found bug(s) (see the reference section), this option for me it is not really an additional help, but an issue generator.

Transaction Cache in MySQL 5.5

From MySQL 5.5.4 the TC is now divide in two parts, one control/store the TRANSACTIONAL statements, another store ONLY the NON_TRANSACTIONAL statements.

What does it means? In theory it means that any statement which was suppose to be part of a TRANSACTION but related to a NON-TRANSACTIONAL table (like statement at the start of a transaction), instead going directly to the binary log should be moved to the TC-statement, and then flushed when the TC is flushed.

The involvement of the binlog_direct_non_transactional_updates complicate a little all the scenario and in addition to that, it also create some major issue in the correct flush, so I am ignoring it here.

Let us see what happens using the new TC mechanism:

THD1 is as usual consistent in the way it flush to binary log. Arbitrarily I have decide that this time THD2 is taking a little bit more time like a millisecond, and that is enough for THD1 to acquire the LOG_LOCK before it.

The scope as you have probably guess was to make more ORDERED the graph, given that this time having one or the other flushing before or after will not really make any difference this time.

This because the TCStatement will be flushed (at least this is what I have seen and understood) when the TC is flushed.

Result is that the flush will take place using same the order on master and on the slave.

Finally THD3 also if will have the possibility to use the TCstatements it will not giving the fact that the statement using a NON-TRANSACTIONAL storage engine is NOT the first statement, so the mixed transaction will be placed in the TC as expected.

Conclusion

There was not so much to say from the beginning, to be honest, the TC and TCstatements are useful only for that case we do mix-transactions. The only significant changes are between after 5.1.31 and before and the introduction of the TCstatement. All the other tentative to add flexibility to this area, had in my vision increased bad behaviors.

It is easy for Master and Slave to store or modify data in a unexpected way, issue that any DBA face probably too often in MySQL.

I am not saying that all that cases should bring back to this specific issue, but for sure it is one of the less known and difficult to track, because many things do influence the how MySQL will flush to binary log.

The golden rule is, do not mix NON-TRANSACTIONAL and TRANSACTIONAL, but if you really have to do that, be sure of doing it in using 5.5.9 and above, which at least gives you more insurance of flushing to binary log the commands consistently.

Last notes

Remember that any statement NON-TRANSACTIONAL in the TC cache will be flushed on COMMIT or ROLLBACK, and that in the latest case a WARNING message should be written in the MySQL error log.

Note that I will post later (when I will have the time) some code and related raw data for documentation purpose.

References:

--------------

http://dev.mysql.com/doc/refman/5.1/en/replication-features-transactions.html

http://dev.mysql.com/doc/refman/5.1/en/replication-options-binary-log.html#sysvar_binlog_cache_size

http://dev.mysql.com/doc/refman/5.1/en/replication-options-binary-log.html#sysvar_binlog_direct_non_transactional_updates

http://dev.mysql.com/doc/refman/5.5/en/replication-options-binary-log.html#sysvar_binlog_stmt_cache_size

http://bugs.mysql.com/bug.php?id=51291

http://www.chriscalender.com/?tag=large-transaction

http://mysqlmusings.blogspot.com/2009/02/mixing-engines-in-transactions.html

MySQL Blogs

Overview

The case

What I have done to replicate the error

Conclusions

My dream

Reference

My thinking and I

The One man/woman Hero

The Family Man/Woman

What and how this fits in consulting/support?

Back to the point please

The International teams first

Family’s teams

One Man Hero’s team

My preference and thoughts

Now about Customer

Dealing with customer onsite

Dealing with customer remote

System or email

Voice

Lesson learn

Conclusion

MySQL Search facility is gone …

The Binary Log

And now the story

Fast/short comment on the new binary log.

Conclusion

References:

Note

The Binary Log

Transaction Cache

Actors involved (who control what)

MySQL Transaction cache and MySQL versions

How the whole works

Note

What about using binlog_direct_non_transactional_updates, will it helps?

Transaction Cache in MySQL 5.5

Conclusion

Last notes

More Articles ...

Related Articles

login