MySQL高可用解决方案:MHA

MHA的简要介绍

  • MHA全称Master High Availability,也就是主节点的高可用,是目前比较成熟的MySQL高可用解决方案。它的主要功能主要是通过一个manager节点来监控主节点和从节点的状况,并会在主节点发生故障的时候,自动将一个数据最贴近Master的从节点转化成主节点。实现自动的故障转移。
  • MHA的变成语言是Perl,需要安装一些软件包来进行编译操作,但是总体的编译过程十分地简单。

实验拓扑

《MySQL高可用解决方案:MHA》 MHA.gif

要点以及基础知识

  • MHA的组件中主要有两个,一个是Manager节点组件。类似于一个监督者。
    Node节点组件则是安装于数据库节点,其中一个作为Master。
  • MHA在主节点发生故障时需要进行主节点自动切换,所以必不可少地需要管理员权限。所以多个节点之间需要基于ssh秘钥认证。
  • MHA的主要配置在于manager。
主机名 主机地址 角色
node1 192.168.2.201 Master节点,安装node组件
node2 192.168.2.202 Slave节点,安装node组件
node3 192.168.2.203 Slave节点,安装node组件
node4 192.168.2.204 安装manager组件

本文使用CentOS7.1,数据库:MariaDB-5.5.50
关于半同步复制的详细配置,可以参考我的上一篇文章。由于篇幅问题,这里主要讲如何安装配置和使用MHA组件。
因为数据库版本是MariaDB-5.5.50,所以选择编译在codegoole上面的mha4mysql-0.56
注意:本文关闭了selinux,以及iptables。

Perl编译安装

最新版MHA下载地址:
mha4mysql-manager
mha4mysql-node
题外话
本来代码是在codegoogle上面进行托管的,甚至连一些介绍的主页也是在codegoogle上面的。
但是由于github的出现,很多软件都转移到github上边了。codegoole上面的rpm包很多都已经失效。
因为来历不明的rpm不敢安装在实际环境中,所以选择使用perl编译安装。

(1)在每一个节点上面进行编译环境的安装

yum  -y install perl-DBD-MySQL perl-Config-Tiny perl-Log-Dispatch perl-Parallel-ForkManager perl-Config-IniFiles  ncftp perl-Params-Validate  perl-CPAN perl-Test-Mock-LWP.noarch perl-LWP-Authen-Negotiate.noarch perl-devel perl-ExtUtils-CBuilder perl-ExtUtils-MakeMaker 

(2)在node4中安装manager组件
a.使用make Makefile.PL检查编译环境,功能类似于./configure
其实node1~node3这三个配置了半同步复制的数据库节点安装的是node组件,但是也是执行这两步。
一般都不会出错。而且node节点不用额外配置,所以不做重复演示了。

[root@node4 mha4mysql-manager-0.56]# perl Makefile.PL 
*** Module::AutoInstall version 1.03
*** Checking for Perl dependencies...
[Core Features]
- DBI                   ...loaded. (1.627)
- DBD::mysql            ...loaded. (4.023)
- Time::HiRes           ...loaded. (1.9725)
- Config::Tiny          ...loaded. (2.14)
- Log::Dispatch         ...loaded. (2.41)
- Parallel::ForkManager ...loaded. (1.05)
- MHA::NodeConst        ...loaded. (0.56)
*** Module::AutoInstall configuration finished.
Writing Makefile for mha4mysql::manager

b.使用make&&make install安装

[root@node4 mha4mysql-manager-0.56]# make&&make install
Skip blib/lib/MHA/ManagerUtil.pm (unchanged)
Skip blib/lib/MHA/Config.pm (unchanged)
Skip blib/lib/MHA/HealthCheck.pm (unchanged)
Skip blib/lib/MHA/ManagerConst.pm (unchanged)
Skip blib/lib/MHA/ServerManager.pm (unchanged)
Skip blib/lib/MHA/ManagerAdmin.pm (unchanged)
Skip blib/lib/MHA/FileStatus.pm (unchanged)
Skip blib/lib/MHA/ManagerAdminWrapper.pm (unchanged)
Skip blib/lib/MHA/MasterFailover.pm (unchanged)
Skip blib/lib/MHA/MasterRotate.pm (unchanged)
Skip blib/lib/MHA/MasterMonitor.pm (unchanged)
Skip blib/lib/MHA/SSHCheck.pm (unchanged)
Skip blib/lib/MHA/Server.pm (unchanged)
Skip blib/lib/MHA/DBHelper.pm (unchanged)
cp bin/masterha_stop blib/script/masterha_stop
/usr/bin/perl "-Iinc" -MExtUtils::MY -e 'MY->fixin(shift)' -- blib/script/masterha_stop
cp bin/masterha_conf_host blib/script/masterha_conf_host
/usr/bin/perl "-Iinc" -MExtUtils::MY -e 'MY->fixin(shift)' -- blib/script/masterha_conf_host
cp bin/masterha_check_repl blib/script/masterha_check_repl
/usr/bin/perl "-Iinc" -MExtUtils::MY -e 'MY->fixin(shift)' -- blib/script/masterha_check_repl
cp bin/masterha_check_status blib/script/masterha_check_status
/usr/bin/perl "-Iinc" -MExtUtils::MY -e 'MY->fixin(shift)' -- blib/script/masterha_check_status
cp bin/masterha_master_monitor blib/script/masterha_master_monitor
/usr/bin/perl "-Iinc" -MExtUtils::MY -e 'MY->fixin(shift)' -- blib/script/masterha_master_monitor
cp bin/masterha_check_ssh blib/script/masterha_check_ssh
/usr/bin/perl "-Iinc" -MExtUtils::MY -e 'MY->fixin(shift)' -- blib/script/masterha_check_ssh
cp bin/masterha_master_switch blib/script/masterha_master_switch
/usr/bin/perl "-Iinc" -MExtUtils::MY -e 'MY->fixin(shift)' -- blib/script/masterha_master_switch
cp bin/masterha_secondary_check blib/script/masterha_secondary_check
/usr/bin/perl "-Iinc" -MExtUtils::MY -e 'MY->fixin(shift)' -- blib/script/masterha_secondary_check
cp bin/masterha_manager blib/script/masterha_manager
/usr/bin/perl "-Iinc" -MExtUtils::MY -e 'MY->fixin(shift)' -- blib/script/masterha_manager
Manifying blib/man1/masterha_stop.1
Manifying blib/man1/masterha_conf_host.1
Manifying blib/man1/masterha_check_repl.1
Manifying blib/man1/masterha_check_status.1
Manifying blib/man1/masterha_master_monitor.1
Manifying blib/man1/masterha_check_ssh.1
Manifying blib/man1/masterha_master_switch.1
Manifying blib/man1/masterha_secondary_check.1
Manifying blib/man1/masterha_manager.1
Appending installation info to /usr/lib64/perl5/perllocal.pod

数据库节点的配置

半同步复制Master节点Node1的MariaDB配置文件

[mysqld]
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock
# Disabling symbolic-links is recommended to prevent assorted security risks
symbolic-links=0
# Settings user and group are ignored when systemd is used.
# If you need to run mysqld under a different user or group,
# customize your systemd unit file for mariadb according to the
# instructions in http://fedoraproject.org/wiki/Systemd
innodb_file_per_table = 1
skip_name_resolve = 1
log_bin = Master-log
log_bin_index = 1
server_id = 1
relay_log=relay-log
relay_log_purge=0
#skip-grant-tables
#skip-networking

[mysqld_safe]
log-error=/var/log/mariadb/mariadb.log
pid-file=/var/run/mariadb/mariadb.pid

#
# include all files from the config directory
#
!includedir /etc/my.cnf.d

这里需要注意的是,
半同步复制主节点和从节点都要启动了二进制日志log_bin = Master-log,中继日志relay_log=relay-log
而且这里关闭了中继日志的修剪功能relay_log_purge=0。因为这由MHA完成。

半同步复制Slave节点Node2node3的MariaDB配置文件

[mysqld]
datadir=/var/lib/mysql/
socket=/var/lib/mysql/mysql.sock
log_bin=Master-bin
# Disabling symbolic-links is recommended to prevent assorted security risks
symbolic-links=0
# Settings user and group are ignored when systemd is used.
# If you need to run mysqld under a different user or group,
# customize your systemd unit file for mariadb according to the
# instructions in http://fedoraproject.org/wiki/Systemd
skip_name_resolve=true
innodb_file_per_table=ture
server_id = 2
log_bin=bin_log
relay_log=relay-log
read_only = 1
relay_log_purge=0

[mysqld_safe]
log-error=/var/log/mariadb/mariadb.log
pid-file=/var/run/mariadb/mariadb.pid

#
# include all files from the config directory
#
!includedir /etc/my.cnf.d

这里比Master节点多一个read_only=1
假如Slave节点被提升为Master节点的话,MHA会自动将这个read_only=1去掉
并且会将修改其他Slave节点指向新的主节点,可以用show slave status\G查看。

Manager节点配置

(1)复制默认文件作为模板,并清空默认配置

cp /etc/masterha/masterha_default.cnf  /etc/masterha/app1.cnf
> /etc/masterha/masterha_default.cnf

(2)配置/etc/masterha/app1.cnf,用于启动manager进程的时候指定。
MHA的一个manager节点可以通过启动多个进程来监控多个MHA集群,所以使用app1,app2的方式。

[server default]
#manager_workdir=/var/log/masterha/app1
#manager_log=/var/log/masterha/app1/manager.log
user=root
password=123456789
manager_workdir=/data/masterha/app1
manager_log=/data/masterha/app1/manager.log
remote_workdir=/data/masterha/app1
ssh_user=root
repl_user=repuser
repl_password=repuser
ping_interval=1

[server1]
hostname=node1
candidate_master=1

[server2]
hostname=node2
candidate_master=1

[server3]
hostname=node3

这里的user和password指的是数据库管理员的账号密码
repl_user和repl_password是具有复制权限的用户和密码
ssh_user=root是ssh的账户,由于是秘钥认证,并不需要密码
配置文件中,hostname=node1是因为主机可以使用node1访问到该主机,这里也可以用ip地址。

(3)创建配置文件中manager_workdir的工作路径

mkdir /data/masterha/app1/

利用MHA的工具测试环境是否正常

(1)测试ssh是否连接正常

[root@node4 mha4mysql-manager-0.56]# masterha_check_ssh --conf=/etc/masterha/app1.cnf
Thu Nov 10 22:59:03 2016 - 
Global configuration file /etc/masterha_default.cnf not found. Skipping. Thu Nov 10 22:59:03 2016 -
Reading application default configuration from /etc/masterha/app1.cnf.. Thu Nov 10 22:59:03 2016 -
Reading server configuration from /etc/masterha/app1.cnf.. Thu Nov 10 22:59:03 2016 -
Starting SSH connection tests.. Thu Nov 10 22:59:04 2016 - [debug] Thu Nov 10 22:59:03 2016 - [debug] Connecting via SSH from root@node1(192.168.2.201:22) to root@node2(192.168.2.202:22).. Thu Nov 10 22:59:03 2016 - [debug] ok. Thu Nov 10 22:59:03 2016 - [debug] Connecting via SSH from root@node1(192.168.2.201:22) to root@node3(192.168.2.203:22).. Thu Nov 10 22:59:03 2016 - [debug] ok. Thu Nov 10 22:59:04 2016 - [debug] Thu Nov 10 22:59:03 2016 - [debug] Connecting via SSH from root@node2(192.168.2.202:22) to root@node1(192.168.2.201:22).. Thu Nov 10 22:59:04 2016 - [debug] ok. Thu Nov 10 22:59:04 2016 - [debug] Connecting via SSH from root@node2(192.168.2.202:22) to root@node3(192.168.2.203:22).. Thu Nov 10 22:59:04 2016 - [debug] ok. Thu Nov 10 22:59:05 2016 - [debug] Thu Nov 10 22:59:04 2016 - [debug] Connecting via SSH from root@node3(192.168.2.203:22) to root@node1(192.168.2.201:22).. Thu Nov 10 22:59:04 2016 - [debug] ok. Thu Nov 10 22:59:04 2016 - [debug] Connecting via SSH from root@node3(192.168.2.203:22) to root@node2(192.168.2.202:22).. Thu Nov 10 22:59:05 2016 - [debug] ok. Thu Nov 10 22:59:05 2016 -
All SSH connection tests passed successfully.

这么多输出信息,其实只看最后一句就知道ssh是否正常了
这里需要注意的是这里指定了刚才配置的app1.

(2)测试复制功能是否正常

[root@node4 mha4mysql-manager-0.56]# masterha_check_repl --conf=/etc/masterha/app1.cnf
Thu Nov 10 23:07:35 2016 - 
Global configuration file /etc/masterha_default.cnf not found. Skipping. Thu Nov 10 23:07:35 2016 -
Reading application default configuration from /etc/masterha/app1.cnf.. Thu Nov 10 23:07:35 2016 -
Reading server configuration from /etc/masterha/app1.cnf.. Thu Nov 10 23:07:35 2016 -
MHA::MasterMonitor version 0.56. Thu Nov 10 23:07:35 2016 -
GTID failover mode = 0 Thu Nov 10 23:07:35 2016 -
Dead Servers: Thu Nov 10 23:07:35 2016 -
Alive Servers: Thu Nov 10 23:07:35 2016 -
node1(192.168.2.201:3306) Thu Nov 10 23:07:35 2016 -
node2(192.168.2.202:3306) Thu Nov 10 23:07:35 2016 -
node3(192.168.2.203:3306) Thu Nov 10 23:07:35 2016 -
Alive Slaves: Thu Nov 10 23:07:35 2016 -
node2(192.168.2.202:3306) Version=5.5.50-MariaDB (oldest major version between slaves) log-bin:enabled Thu Nov 10 23:07:35 2016 -
Replicating from 192.168.2.201(192.168.2.201:3306) Thu Nov 10 23:07:35 2016 -
Primary candidate for the new Master (candidate_master is set) Thu Nov 10 23:07:35 2016 -
node3(192.168.2.203:3306) Version=5.5.50-MariaDB (oldest major version between slaves) log-bin:enabled Thu Nov 10 23:07:35 2016 -
Replicating from 192.168.2.201(192.168.2.201:3306) Thu Nov 10 23:07:35 2016 -
Current Alive Master: node1(192.168.2.201:3306) Thu Nov 10 23:07:35 2016 -
Checking slave configurations.. Thu Nov 10 23:07:35 2016 -
relay_log_purge=0 is not set on slave node3(192.168.2.203:3306). Thu Nov 10 23:07:35 2016 -
Checking replication filtering settings.. Thu Nov 10 23:07:35 2016 -
binlog_do_db= , binlog_ignore_db= Thu Nov 10 23:07:35 2016 -
Replication filtering check ok. Thu Nov 10 23:07:35 2016 -
GTID (with auto-pos) is not supported Thu Nov 10 23:07:35 2016 -
Starting SSH connection tests.. Thu Nov 10 23:07:37 2016 -
All SSH connection tests passed successfully. Thu Nov 10 23:07:37 2016 -
Checking MHA Node version.. Thu Nov 10 23:07:37 2016 -
Version check ok. Thu Nov 10 23:07:37 2016 -
Checking SSH publickey authentication settings on the current master.. Thu Nov 10 23:07:37 2016 -
HealthCheck: SSH to node1 is reachable. Thu Nov 10 23:07:37 2016 -
Master MHA Node version is 0.56. Thu Nov 10 23:07:37 2016 -
Checking recovery script configurations on node1(192.168.2.201:3306).. Thu Nov 10 23:07:37 2016 -
Executing command: save_binary_logs --command=test --start_pos=4 --binlog_dir=/var/lib/mysql,/var/log/mysql --output_file=/data/masterha/app1/save_binary_logs_test --manager_version=0.56 --start_file=Master-log.000006 Thu Nov 10 23:07:37 2016 -
Connecting to root@192.168.2.201(node1:22).. Creating /data/masterha/app1 if not exists.. ok. Checking output directory is accessible or not.. ok. Binlog found at /var/lib/mysql, up to Master-log.000006 Thu Nov 10 23:07:38 2016 -
Binlog setting check done. Thu Nov 10 23:07:38 2016 -
Checking SSH publickey authentication and checking recovery script configurations on all alive slave servers.. Thu Nov 10 23:07:38 2016 -
Executing command : apply_diff_relay_logs --command=test --slave_user='root' --slave_host=node2 --slave_ip=192.168.2.202 --slave_port=3306 --workdir=/data/masterha/app1 --target_version=5.5.50-MariaDB --manager_version=0.56 --relay_log_info=/var/lib/mysql/relay-log.info --relay_dir=/var/lib/mysql/ --slave_pass=xxx Thu Nov 10 23:07:38 2016 -
Connecting to root@192.168.2.202(node2:22).. Checking slave recovery environment settings.. Opening /var/lib/mysql/relay-log.info ... ok. Relay log found at /var/lib/mysql, up to relay-log.000004 Temporary relay log file is /var/lib/mysql/relay-log.000004 Testing mysql connection and privileges.. done. Testing mysqlbinlog output.. done. Cleaning up test file(s).. done. Thu Nov 10 23:07:38 2016 -
Executing command : apply_diff_relay_logs --command=test --slave_user='root' --slave_host=node3 --slave_ip=192.168.2.203 --slave_port=3306 --workdir=/data/masterha/app1 --target_version=5.5.50-MariaDB --manager_version=0.56 --relay_log_info=/var/lib/mysql/relay-log.info --relay_dir=/var/lib/mysql/ --slave_pass=xxx Thu Nov 10 23:07:38 2016 -
Connecting to root@192.168.2.203(node3:22).. Checking slave recovery environment settings.. Opening /var/lib/mysql/relay-log.info ... ok. Relay log found at /var/lib/mysql, up to relay-log.000002 Temporary relay log file is /var/lib/mysql/relay-log.000002 Testing mysql connection and privileges.. done. Testing mysqlbinlog output.. done. Cleaning up test file(s).. done. Thu Nov 10 23:07:38 2016 -
Slaves settings check done. Thu Nov 10 23:07:38 2016 -
node1(192.168.2.201:3306) (current master) +--node2(192.168.2.202:3306) +--node3(192.168.2.203:3306) Thu Nov 10 23:07:38 2016 -
Checking replication health on node2.. Thu Nov 10 23:07:38 2016 -
ok. Thu Nov 10 23:07:38 2016 -
Checking replication health on node3.. Thu Nov 10 23:07:38 2016 -
ok. Thu Nov 10 23:07:38 2016 -
master_ip_failover_script is not defined. Thu Nov 10 23:07:38 2016 -
shutdown_script is not defined. Thu Nov 10 23:07:38 2016 -
Got exit code 0 (Not master dead). MySQL Replication Health is OK.

(3)最激动人心的时刻到了,启动服务!

[root@node4 mha4mysql-manager-0.56]# nohup masterha_manager --conf=/etc/masterha/app1.cnf > /data/masterha/app1/manager.log 2>&1 &
[1] 8463

(4)查看masterha是否正在正常运行,还有主节点信息。

[root@node4 mha4mysql-manager-0.56]# masterha_check_status --conf=/etc/masterha/app1.cnf
app1 (pid:8463) is running(0:PING_OK), master:node1

模拟MHA故障

(1)Master节点·node1·关闭MariaDB

systemctl stop mariadb.service

(2)查看manager节点的状况

[root@node4 mha4mysql-manager-0.56]# masterha_check_status --conf=/etc/masterha/app1.cnfapp1 is stopped(2:NOT_RUNNING).
[1]+  Done                    nohup masterha_manager --conf=/etc/masterha/app1.cnf > /data/masterha/app1/manager.log 2>&1

可以看出MHA程序masterha_manager已经退出了
同时还要注意一点,在工作路径/data/masterha/app1/下会生成一个app1.failover.complete的文件。
如果需要启动的时候,最好删除这个文件,否则会启动失败。

(3)去node3查看slave信息,node3指向新的Master节点。

MariaDB [(none)]> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.2.202
                  Master_User: repuser
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: bin_log.000002
          Read_Master_Log_Pos: 245
               Relay_Log_File: relay-log.000002
                Relay_Log_Pos: 527
        Relay_Master_Log_File: bin_log.000002
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 245
              Relay_Log_Space: 815
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 2

(4)node2原本作为从节点所设置的只读属性也自动取消了。

MariaDB [(none)]> show variables like '%read_only%';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| read_only     | OFF   |
+---------------+-------+
1 row in set (0.00 sec)

(5)灾后重建的步骤
我们知道,当时原有master故障的时候,masterha_manager会通过二进制日志和中继日志的状况,选举出新的master节点,并由只读状态改为可读写的状态会退出。
所以接下来要怎么做呢?
a.删除工作路径下的failover.complete文件。
如/data/masterha/app1/app1.failover.complete
b.原有的master,也就是node1节点。
需要清空数据库,再将node2全备一次,恢复到node1上面来
并配置node1为Slave节点,并指向新的节点node2
c.重新通过masterha_check等工具检测环境是否正常,并重新启动MHA的主程序masterha_manager。

    原文作者:酱油菠菜
    原文地址: https://www.jianshu.com/p/eaf79591e719
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞