mita2 database life

主にMySQLに関するメモです

Group Replication を試す(3) 障害時の挙動

group_replication_recovery チャネルは障害後のサーバ間の差分同期のみに使われる。

Group Replication のステータスが ONLINE 状態のときは、Master_HostがNULL、Slave_{IO,SQL}_Running は NO。

mysql>  select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+---------------------------------------+-------------+--------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+---------------------------------------+-------------+--------------+
| group_replication_applier | 0cdd0b6a-cd84-11e6-b223-fa163e0b2b7c | gr02 | 3306 | ONLINE |
| group_replication_applier | 1edf2e1d-cd83-11e6-961b-fa163e83e8e7 | gr01 | 3306 | ONLINE |
| group_replication_applier | a1c37edb-cd89-11e6-a463-fa163e49d992 | gr03 | 3306 | ONLINE |
+---------------------------+--------------------------------------+---------------------------------------+-------------+--------------+
3 rows in set (0.00 sec)


mysql> show slave status for channel 'group_replication_recovery' \G
*************************** 1. row ***************************
Slave_IO_State:
Master_Host:
Master_User: rpl_user
Master_Port: 0
Connect_Retry: 60
Master_Log_File:
Read_Master_Log_Pos: 4
Relay_Log_File: gr02-group_replication_recovery.000001
Relay_Log_Pos: 4
Relay_Master_Log_File:
Slave_IO_Running: No
Slave_SQL_Running: No
Replicate_Do_DB:
## snip ##

障害によって他のサーバと差分が生じ、差分同期中(RECOVERING) になると、Group Replication Plugin が自動的にマスターを指定し、差分を同期する。

mysql>  select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+---------------------------------------+-------------+--------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+---------------------------------------+-------------+--------------+
| group_replication_applier | 0cdd0b6a-cd84-11e6-b223-fa163e0b2b7c | gr02 | 3306 | RECOVERING |
| group_replication_applier | 1edf2e1d-cd83-11e6-961b-fa163e83e8e7 | gr01 | 3306 | ONLINE |
| group_replication_applier | a1c37edb-cd89-11e6-a463-fa163e49d992 | gr03 | 3306 | ONLINE |
+---------------------------+--------------------------------------+---------------------------------------+-------------+--------------+
3 rows in set (0.00 sec)

mysql> show slave status for channel 'group_replication_recovery' \G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: gr01
Master_User: rpl_user
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: binlog.000001
Read_Master_Log_Pos: 65574674
Relay_Log_File: gr02-group_replication_recovery.000003
Relay_Log_Pos: 31852573
Relay_Master_Log_File: binlog.000001
Slave_IO_Running: Yes
Slave_SQL_Running: Yes

## snip ##

binary log がすでにロストしてた場合

galera ではIST(差分同期)をあきらめて、SST(バックアップ取得→同期) になるケース。 Group Replication では、バイナリログを探し続けて、試行回数が一定に達すると、起動できずに終了する。

2016-12-30T01:54:26.017573Z 14 [Note] Plugin group_replication reported: 'Retrying group recovery connection with another donor. Attempt 5/10'
2016-12-30T01:55:26.030228Z 14 [Note] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_recovery' executed'. Previous state master_host='', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host
='gr02', master_port= 3306, master_log_file='', master_log_pos= 4, master_bind=''.
2016-12-30T01:55:26.045140Z 14 [Note] Plugin group_replication reported: 'Establishing connection to a group replication recovery donor 0cdd0b6a-cd84-11e6-b223-fa163e0b2b7c at gr02 port: 3306.'
2016-12-30T01:55:26.046151Z 23 [Warning] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for STA
RT SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.
2016-12-30T01:55:26.059113Z 23 [Note] Slave I/O thread for channel 'group_replication_recovery': connected to master 'rpl_user@gr02:3306',replication started in log 'FIRST' at position 4
2016-12-30T01:55:26.061461Z 24 [Note] Slave SQL thread for channel 'group_replication_recovery' initialized, starting replication in log 'FIRST' at position 0, relay log './gr03-group_replication_recovery.000001' position: 4
2016-12-30T01:55:26.066858Z 23 [ERROR] Error reading packet from server for channel 'group_replication_recovery': The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires. (server_errno=1236)
2016-12-30T01:55:26.066904Z 23 [ERROR] Slave I/O for channel 'group_replication_recovery': Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.', Error_code: 1236
2016-12-30T01:55:26.066917Z 23 [Note] Slave I/O thread exiting for channel 'group_replication_recovery', read up to log 'FIRST', position 4
2016-12-30T01:55:26.066983Z 14 [Note] Plugin group_replication reported: 'Terminating existing group replication donor connection and purging the corresponding logs.'
2016-12-30T01:55:26.067037Z 24 [Note] Error reading relay log event for channel 'group_replication_recovery': slave SQL thread was killed
2016-12-30T01:55:26.079298Z 14 [Note] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_recovery' executed'. Previous state master_host='gr02', master_port= 3306, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''.
2016-12-30T01:55:26.091463Z 14 [Note] Plugin group_replication reported: 'Retrying group recovery connection with another donor. Attempt 6/10'

## snip ##

2016-12-30T02:00:22.345848Z 0 [Note] InnoDB: Starting shutdown...
2016-12-30T02:00:22.446097Z 0 [Note] InnoDB: Dumping buffer pool(s) to /var/lib/mysql/ib_buffer_pool
2016-12-30T02:00:22.447044Z 0 [Note] InnoDB: Buffer pool(s) dump completed at 161230 11:00:22
2016-12-30T02:00:23.959156Z 0 [Note] InnoDB: Shutdown completed; log sequence number 280039404
2016-12-30T02:00:23.962273Z 0 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1"
2016-12-30T02:00:23.962291Z 0 [Note] Shutting down plugin 'MyISAM'
2016-12-30T02:00:23.962302Z 0 [Note] Shutting down plugin 'CSV'
2016-12-30T02:00:23.962308Z 0 [Note] Shutting down plugin 'PERFORMANCE_SCHEMA'
2016-12-30T02:00:23.962346Z 0 [Note] Shutting down plugin 'sha256_password'
2016-12-30T02:00:23.962352Z 0 [Note] Shutting down plugin 'mysql_native_password'
2016-12-30T02:00:23.963043Z 0 [Note] Shutting down plugin 'binlog'
2016-12-30T02:00:23.963818Z 0 [Note] /usr/sbin/mysqld: Shutdown complete

スプリットブレイン時

2台構成で、1台を落とし、スプリットブレイン状態を再現。

mysql> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+---------------------------------------+-------------+--------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+---------------------------------------+-------------+--------------+
| group_replication_applier | 0cdd0b6a-cd84-11e6-b223-fa163e0b2b7c | gr02 | 3306 | UNREACHABLE |
| group_replication_applier | 1edf2e1d-cd83-11e6-961b-fa163e83e8e7 | gr01 | 3306 | ONLINE |
+---------------------------+--------------------------------------+---------------------------------------+-------------+--------------+
2 rows in set (0.00 sec)


mysql> select * from test.t limit 2;
+----+---------------------+
| pk | col1 |
+----+---------------------+
| 1 | 2016-12-29 15:23:07 |
| 8 | 2016-12-29 15:23:07 |
+----+---------------------+
2 rows in set (0.00 sec)


mysql> INSERT INTO test.t (col1) VALUES(NOW());
(戻ってこない)
mysql> SHOW PROCESSLIST;
+------+-------------+-----------+------+---------+-------+--------------------------------------------------------+-----------------------------------------+
| Id | User | Host | db | Command | Time | State | Info |
+------+-------------+-----------+------+---------+-------+--------------------------------------------------------+-----------------------------------------+
| 7 | system user | | NULL | Connect | 75875 | executing | NULL |
| 10 | system user | | NULL | Connect | 75875 | Slave has read all relay log; waiting for more updates | NULL |
| 5482 | root | localhost | NULL | Query | 53 | query end | INSERT INTO test.t (col1) VALUES(NOW()) |
| 5786 | root | localhost | NULL | Query | 0 | starting | SHOW PROCESSLIST |
+------+-------------+-----------+------+---------+-------+--------------------------------------------------------+-----------------------------------------+
4 rows in set (0.00 sec)

SELECTはできるが、更新はブロックされ、ハングしたような状態になる。スプリットブレイン時も整合性は保たれる。スプリットブレインが解消されるとINSERT文が成功する。 Lab 版ではスプリットブレインの状態をうまくハンドリングできてなかったが、GAではちゃんとハンドリングできるようになってた。

2台だと1台死んだだけでスプリットブレインになってしまうため、Group Replicationを構成する場合は3台以上が推奨される。