NameNode HA异常调查

【起因】22:31:04分被通知集群异常,无法正常处理调度任务;远程登录查看,发现HDFS NameNode发生脑裂;

NameNode节点:name21.hadoop,name22.hadoop
hadoop版本:2.7.1

【事故报告】

1. 【时间】=> 22:31:04
【HA状态】=> name21.hadoop:Active, name22.hadoop:Standyby
【日志信息】=> name21.hadoop:/var/log/hadoop/hdfs/gc.log-xxxxxxxxxx

此时,Active NameNode触发一次异常JVM GC(正常情况下GC耗时在2s以下),此次耗时32.46s;

2018-02-11T22:31:04.707+0800: 801030.875: [GC2018-02-11T22:31:04.707+0800: 801030.875: [ParNew (promotion failed): 1550355K->1544189K(1710528K), 0.8542820 secs]
2018-02-11T22:31:05.562+0800: 801031.729: [CMS: 12100861K->3636381K(13303808K), 31.6119450 secs] 13650016K->3636381K(15014336K), [CMS Perm : 55251K->55008K(131072K)], 32.4666280 secs] [Times: user=33.04 sys=0.00, **real=32.46 secs**]

2.【时间】=> 22:31:37
【HA状态】=> name21.hadoop:Active, name22.hadoop:Standyby
【日志信息】=> name21.hadoop:/var/log/hadoop/hdfs/hadoop-hdfs-namenode-name21.hadoop.log

此时,由于之前GC时间过长,导致active NameNode向JournalNodes进行数据通信时,超时导致无法响应(超时时间默认值为20s,GC耗时32s);

2018-02-11 22:31:37,201 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: flush failed for required journal (JournalAndStream(mgr=QJM to [xxx.xx.xx.xx:8485, xxx.xx.xx.xx:8485, xxx.xx.xx.xx:8485], stream=QuorumOutputStream starting at txid 3108262584)) java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond. 
2018-02-11 22:31:37,208 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(191)) - Detected pause in JVM or host machine (eg GC): pause of approximately 32158ms GC pool 'ParNew' had collection(s): count=1 time=855ms GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=31612ms 
2018-02-11 22:31:37,211 INFO namenode.FSNamesystem (FSNamesystem.java:rollEditLog(5937)) - Roll Edit Log from 172.21.54.22 
2018-02-11 22:31:37,211 INFO namenode.FSEditLog (FSEditLog.java:rollEditLog(1202)) - Rolling edit logs 2018-02-11 22:31:37,211 INFO namenode.FSEditLog (FSEditLog.java:endCurrentLogSegment(1258)) - Ending log segment 3108262584 
2018-02-11 22:31:37,228 INFO util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status 1

至此,Active Namenode节点进入异常状态(进程无法结束,程序所起8021端口不存在);

3.【时间】=> 22:32:22
【HA状态】=> name21.hadoop:Active, name22.hadoop:Standyby
【日志信息】=> name21.hadoop:/var/log/hadoop/hdfs/hadoop-hdfs-zkfc-name21.hadoop.log

此时, name21.hadoop上的ZKFailoverController无法获取到本机Namenode的状态[无法将Active状态修改为Standby],此后每隔45s反复重试;同时,与zookeeper的session连接关闭;

2018-02-11 22:32:22,267 INFO ha.HealthMonitor (HealthMonitor.java:enterState(249)) - Entering state SERVICE_NOT_RESPONDING 
2018-02-11 22:32:22,267 INFO ha.ZKFailoverController (ZKFailoverController.java:setLastHealthState(852)) - Local service NameNode at name21.hadoop/xxx.xx.xx.xx:8021 entered state: SERVICE_NOT_RESPONDING 
2018-02-11 22:32:22,267 INFO ha.ZKFailoverController (ZKFailoverController.java:recheckElectability(768)) - Quitting master election for NameNode at name21.hadoop2/ xxx.xx.xx.xx:8021 and marking that fencing is necessary 
2018-02-11 22:32:22,267 INFO ha.ActiveStandbyElector (ActiveStandbyElector.java:quitElection(401)) - Yielding from election 
2018-02-11 22:32:22,268 INFO ha.ActiveStandbyElector (ActiveStandbyElector.java:terminateConnection(830)) - Terminating ZK connection for elector id=660690290 appData=0a06636261734e4112036e6e311a0e6e616d6532312e6861646f6f703220d53e28d33e cb=Elector callbacks for NameNode at name21.hadoop2/172.21.54.21:8021 
2018-02-11 22:32:22,270 INFO zookeeper.ZooKeeper (ZooKeeper.java:close(684)) - Session: 0x15f5451238a08b3 closed 
2018-02-11 22:32:22,271 WARN ha.ActiveStandbyElector (ActiveStandbyElector.java:isStaleClient(1066)) - Ignoring stale result from old client with sessionId 0x15f5451238a08b3 
2018-02-11 22:32:22,271 INFO zookeeper.ClientCnxn (ClientCnxn.java:run(524)) - EventThread shut down  

4.【时间】=> 22:32:22
【HA状态】=> name21.hadoop:Active, name22.hadoop:Standyby
【日志信息】=> name22.hadoop:/var/log/hadoop/hdfs/hadoop-hdfs-zkfc-name22.hadoop.log

与此同时, name22.hadoop的namenode与zookeeper建立session连接, ZKFailoverController 尝试连接name21.hadoop上的Namenode,将状态从Active修改为Standby, 报错连接超时;

2018-02-11 22:32:22,286 INFO ha.ActiveStandbyElector (ActiveStandbyElector.java:fenceOldActive(931)) - Checking for any old active which needs to be fenced... 
2018-02-11 22:32:22,290 INFO ha.ActiveStandbyElector (ActiveStandbyElector.java:fenceOldActive(952)) - Old node exists: 0a06636261734e4112036e6e311a0e6e616d6532312e6861646f6f703220d53e28d33e
2018-02-11 22:32:22,294 INFO ha.ZKFailoverController (ZKFailoverController.java:doFence(513)) - Should fence: NameNode at name21.hadoop/ xxx.xx.xx.xx:8021 
2018-02-11 22:32:27,304 WARN ha.FailoverController (FailoverController.java:tryGracefulFence(178)) - Unable to gracefully make NameNode at name21.hadoop/ xxx.xx.xx.xx:8021 standby (unable to connect) java.net.SocketTimeoutException: Call From name22.hadoop/ xxx.xx.xx.xx to name21.hadoop:8021 failed on socket timeout exception: java.net.SocketTimeoutException: 5000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/ xxx.xx.xx.xx :56511remote=name21.hadoop/ xxx.xx.xx.xx :8021];  
......
Caused by: **java.net.SocketTimeoutException: 5000 millis timeout** while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/ xxx.xx.xx.xx :56511 remote=name21.hadoop/ xxx.xx.xx.xx :8021] 
......

启动Fence服务(集群并未设置,直接返回ture),并将name22.hadoop Namenode的状态从Standby变为Active;

2018-02-11 22:32:27,323 INFO ha.NodeFencer (NodeFencer.java:fence(91)) - ====== Beginning Service Fencing Process... ====== 
2018-02-11 22:32:27,323 INFO ha.NodeFencer (NodeFencer.java:fence(94)) - Trying method 1/1: org.apache.hadoop.ha.ShellCommandFencer(/bin/true) 
2018-02-11 22:32:27,339 INFO ha.ShellCommandFencer (ShellCommandFencer.java:tryFence(99)) - Launched fencing command '/bin/true' with pid 48839 
2018-02-11 22:32:27,359 INFO ha.NodeFencer (NodeFencer.java:fence(98)) - ====== Fencing successful by method org.apache.hadoop.ha.ShellCommandFencer(/bin/true) ====== 
2018-02-11 22:32:27,360 INFO ha.ActiveStandbyElector (ActiveStandbyElector.java:writeBreadCrumbNode(878)) - Writing znode /hadoop-ha/cbasNA/ActiveBreadCrumb to indicate that the local node is the most recent active... 
2018-02-11 22:32:27,365 INFO ha.ZKFailoverController (ZKFailoverController.java:becomeActive(380)) - Trying to make NameNode at name22.hadoop/ xxx.xx.xx.xx :8021 active... 
2018-02-11 22:32:46,337 INFO ha.ZKFailoverController (ZKFailoverController.java:becomeActive(387)) - Successfully transitioned NameNode at name22.hadoop/ xxx.xx.xx.xx :8021 to active state

5.【时间】=> 22:32:46
【HA状态】=> name21.hadoop:Active, name22.hadoop:Active
至此,集群的NameNode出现双Active状态,发生脑裂;

6.【恢复措施】
后面运维介入时,尝试重启namenode,发现一直卡在35%进度,无法在执行下去;

原因:运维介入时,首先尝试重启name22.hadoop上的namenode,此时重启程序检测到双active,程序会反复获取状态,直到active只有一个;

NameNode HA states: active_namenodes = [(‘nn1’, ‘name21.hadoop:50070’), (‘nn2’, ‘name22.hadoop:50070’)], standby_namenodes = [], unknown_namenodes = []

此时出现的情况是:
name21.hadoop的namenode异常(8021端口连接超时),无法将自身状态从active修改为standby;
name22.hadoop的namenode已经被ZKFailoverController选举为active,选举理由也是name21.hadoop的8021端口连接异常;

因此,需要重启name21.hadoop上的namenode(恢复8021端口),重启时会检测name22.hadoop的namenode状态(检测name22.hadoop的8021端口),将自身从active切换为standby;name22.hadoop无需重启;

7.【预防措施】
a. 优化GC参数,添加namenode的GC监控,找出GC异常的原因并修复;
b. 提高journalNode的写入超时时间,目前默认值是20s,对于生产环境有必要提高到60s;
c. 启动Fence服务;
d. 集群namenode节点的状态监控;

博客主页:https://www.jianshu.com/u/e97bb429f278

    原文作者:伍柒大人的三言两语
    原文地址: https://www.jianshu.com/p/ecf5025302f9
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞