本文主要是介绍多线程回放+flush tables with read lock 死锁,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
文章目录
- 多线程回放+flush tables with read lock 死锁
- 一、场景描述
- 二、死锁排查
- 三、解决办法
- 四、如何复现的?
多线程回放+flush tables with read lock 死锁
一、场景描述
MySQL-5.7.18 slave实例上夜间进行备份操作时,
processlist 结果,只列出关键部分
mysql> show processlist;l
+----+-------------+-----------+--------------------+---------+------+--------------------------------------------------------+--------------------------------------------+
| Id | User | Host | db | Command | Time | State | Info |
+----+-------------+-----------+--------------------+---------+------+--------------------------------------------------------+--------------------------------------------+
| 3 | root | localhost | test_dead_lock | Query | 2264 | Waiting for commit lock | flush tables with read lock |
| 4 | system user | | NULL | Connect | 2708 | Waiting for master to send event | NULL |
| 5 | system user | | NULL | Connect | 2289 | Slave has read all relay log; waiting for more updates | NULL |
| 6 | system user | | NULL | Connect | 2319 | Waiting for global read lock | insert into ashe(id,name) values(40,'aaa') |
| 7 | system user | | NULL | Connect | 2308 | Waiting for preceding transaction to commit | NULL |
| 8 | system user | | NULL | Connect | 2708 | Waiting for an event from Coordinator | NULL |
| 9 | system user | | NULL | Connect | 2708 | Waiting for an event from Coordinator | NULL |
| 10 | system user | | NULL | Connect | 2708 | Waiting for an event from Coordinator | NULL |
| 11 | system user | | NULL | Connect | 2708 | Waiting for an event from Coordinator | NULL |
| 12 | system user | | NULL | Connect | 2708 | Waiting for an event from Coordinator | NULL |
| 13 | system user | | NULL | Connect | 2708 | Waiting for an event from Coordinator | NULL |
| 14 | system user | | NULL | Connect | 2708 | Waiting for an event from Coordinator | NULL |
| 15 | system user | | NULL | Connect | 2708 | Waiting for an event from Coordinator | NULL |
| 16 | system user | | NULL | Connect | 2708 | Waiting for an event from Coordinator | NULL |
slave相关参数设置
mysql> show global variables like '%slave%'-> ;
+-------------------------------------------+-----------------------+
| Variable_name | Value |
+-------------------------------------------+-----------------------+
| init_slave | |
| log_slave_updates | ON |
| log_slow_slave_statements | ON |
| rpl_semi_sync_master_wait_for_slave_count | 1 |
| rpl_semi_sync_master_wait_no_slave | ON |
| rpl_semi_sync_slave_enabled | ON |
| rpl_semi_sync_slave_trace_level | 32 |
| rpl_stop_slave_timeout | 31536000 |
| slave_allow_batching | OFF |
| slave_checkpoint_group | 512 |
| slave_checkpoint_period | 300 |
| slave_compressed_protocol | OFF |
| slave_exec_mode | STRICT |
| slave_load_tmpdir | /data/mysql/mysql_tmp |
| slave_max_allowed_packet | 1073741824 |
| slave_net_timeout | 30 |
| slave_parallel_type | LOGICAL_CLOCK |
| slave_parallel_workers | 16 |
| slave_pending_jobs_size_max | 16777216 |
| slave_preserve_commit_order | ON |
| slave_rows_search_algorithms | INDEX_SCAN,HASH_SCAN |
| slave_skip_errors | OFF |
| slave_sql_verify_checksum | ON |
| slave_transaction_retries | 128 |
| slave_type_conversions | |
| sql_slave_skip_counter | 0 |
+-------------------------------------------+-----------------------+
此slave实例没有为业务提供查询服务,凌晨会有xtrabackup备份操作。
二、死锁排查
从processlist中可以看到如下信息
- 线程3是xtrabackup 下发的
flush tables with read lock
被其他线程阻塞,一般出现这个问题,证明是有慢语句存在。 - 线程6是sql回放线程,被线程3阻塞
- 线程7的状态为
Waiting for preceding transaction to commit
排查思路
Waiting for preceding transaction to commit
的意思
这条信息对应如下的PSI_stage_info
PSI_stage_info stage_worker_waiting_for_its_turn_to_commit= { 0, "Waiting for preceding transaction to commit", 0};
查看合适进入stage_worker_waiting_for_its_turn_to_commit的状态,是在Commit_order_manager::wait_for_its_turn中
bool Commit_order_manager::wait_for_its_turn(Slave_worker *worker,bool all)
{DBUG_ENTER("Commit_order_manager::wait_for_its_turn");/*When prior transaction fail, current trx should stop and wait for signalto rollback itself*/if ((all || ending_single_stmt_trans(worker->info_thd, all) || m_rollback_trx) &&m_workers[worker->id].status == OCS_WAIT){PSI_stage_info old_stage;mysql_cond_t *cond= &m_workers[worker->id].cond;THD *thd= worker->info_thd;DBUG_PRINT("info", ("Worker %lu is waiting for commit signal", worker->id));mysql_mutex_lock(&m_mutex);thd->ENTER_COND(cond, &m_mutex,&stage_worker_waiting_for_its_turn_to_commit,&old_stage);
其实Commit_order_manager这个类是为了解决slave并行回放时,slave binlog乱序的问题。就是slave的binlog提交顺序严格按照主库上的提交顺序来,那么如果两个事务可以并行回放,并且编号更大的事务执行的比更小的事务要快,则此事务进入order commit阶段时必须等待事务编号较小的事务。
- 根据如上所述可以分析出必然有事务未进入到提交阶段,可以查看其他的sql回放线程
如下,但是发现此事务已经被flush tables阻塞了,
| 6 | system user | | NULL | Connect | 2319 | Waiting for global read lock | insert into ashe(id,name) values(40,'aaa') |
其实出问题的时候,可以通过查询全局的gtid_owned来验证猜测,如下:
gtid_owned | cdfe45e6-c227-11e8-abf5-001c42bf9720:19#6:20#7
可以看到线程6当前执行的事务的gtid的sid为19,线程7的为20,线程7确实是在等待线程6.
3.查看线程6的状态
Waiting for global read lock
这证明是在sql执行时,打开表被阻塞。
4.基本上可以定位到问题了,
-
flush tables with read lock 被线程7阻塞
-
线程7要等待线程6,保证顺序提交
-
线程6等待 flush tables with read lock。
三、解决办法
猜想一下,能用什么办法解决呢?
- 假设一:stop slave
- 假设二:kill flush tables
- 假设三:kill sql回放线程
四、如何复现的?
需要对多线程复制,事务提交等有比较清晰的认识才可以。
这篇关于多线程回放+flush tables with read lock 死锁的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!