NameNode HA异常调查

2023年1月20日 308次阅读来源: 伍柒大人的三言两语

【起因】22:31:04分被通知集群异常，无法正常处理调度任务；远程登录查看，发现HDFS NameNode发生脑裂；

NameNode节点：name21.hadoop，name22.hadoop
hadoop版本：2.7.1

【事故报告】

1. 【时间】=> 22:31:04
【HA状态】=> name21.hadoop：Active, name22.hadoop：Standyby
【日志信息】=> name21.hadoop：/var/log/hadoop/hdfs/gc.log-xxxxxxxxxx

此时，Active NameNode触发一次异常JVM GC（正常情况下GC耗时在2s以下），此次耗时32.46s；

2018-02-11T22:31:04.707+0800: 801030.875: [GC2018-02-11T22:31:04.707+0800: 801030.875: [ParNew (promotion failed): 1550355K->1544189K(1710528K), 0.8542820 secs]
2018-02-11T22:31:05.562+0800: 801031.729: [CMS: 12100861K->3636381K(13303808K), 31.6119450 secs] 13650016K->3636381K(15014336K), [CMS Perm : 55251K->55008K(131072K)], 32.4666280 secs] [Times: user=33.04 sys=0.00, **real=32.46 secs**]

2.【时间】=> 22:31:37
【HA状态】=> name21.hadoop：Active, name22.hadoop：Standyby
【日志信息】=> name21.hadoop：/var/log/hadoop/hdfs/hadoop-hdfs-namenode-name21.hadoop.log

此时，由于之前GC时间过长，导致active NameNode向JournalNodes进行数据通信时，超时导致无法响应（超时时间默认值为20s，GC耗时32s）；

2018-02-11 22:31:37,201 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: flush failed for required journal (JournalAndStream(mgr=QJM to [xxx.xx.xx.xx:8485, xxx.xx.xx.xx:8485, xxx.xx.xx.xx:8485], stream=QuorumOutputStream starting at txid 3108262584)) java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond. 
2018-02-11 22:31:37,208 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(191)) - Detected pause in JVM or host machine (eg GC): pause of approximately 32158ms GC pool 'ParNew' had collection(s): count=1 time=855ms GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=31612ms 
2018-02-11 22:31:37,211 INFO namenode.FSNamesystem (FSNamesystem.java:rollEditLog(5937)) - Roll Edit Log from 172.21.54.22 
2018-02-11 22:31:37,211 INFO namenode.FSEditLog (FSEditLog.java:rollEditLog(1202)) - Rolling edit logs 2018-02-11 22:31:37,211 INFO namenode.FSEditLog (FSEditLog.java:endCurrentLogSegment(1258)) - Ending log segment 3108262584 
2018-02-11 22:31:37,228 INFO util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status 1

至此，Active Namenode节点进入异常状态（进程无法结束，程序所起8021端口不存在）；

3.【时间】=> 22:32:22
【HA状态】=> name21.hadoop：Active, name22.hadoop：Standyby
【日志信息】=> name21.hadoop：/var/log/hadoop/hdfs/hadoop-hdfs-zkfc-name21.hadoop.log

此时, name21.hadoop上的ZKFailoverController无法获取到本机Namenode的状态[无法将Active状态修改为Standby]，此后每隔45s反复重试；同时，与zookeeper的session连接关闭；

2018-02-11 22:32:22,267 INFO ha.HealthMonitor (HealthMonitor.java:enterState(249)) - Entering state SERVICE_NOT_RESPONDING 
2018-02-11 22:32:22,267 INFO ha.ZKFailoverController (ZKFailoverController.java:setLastHealthState(852)) - Local service NameNode at name21.hadoop/xxx.xx.xx.xx:8021 entered state: SERVICE_NOT_RESPONDING 
2018-02-11 22:32:22,267 INFO ha.ZKFailoverController (ZKFailoverController.java:recheckElectability(768)) - Quitting master election for NameNode at name21.hadoop2/ xxx.xx.xx.xx:8021 and marking that fencing is necessary 
2018-02-11 22:32:22,267 INFO ha.ActiveStandbyElector (ActiveStandbyElector.java:quitElection(401)) - Yielding from election 
2018-02-11 22:32:22,268 INFO ha.ActiveStandbyElector (ActiveStandbyElector.java:terminateConnection(830)) - Terminating ZK connection for elector id=660690290 appData=0a06636261734e4112036e6e311a0e6e616d6532312e6861646f6f703220d53e28d33e cb=Elector callbacks for NameNode at name21.hadoop2/172.21.54.21:8021 
2018-02-11 22:32:22,270 INFO zookeeper.ZooKeeper (ZooKeeper.java:close(684)) - Session: 0x15f5451238a08b3 closed 
2018-02-11 22:32:22,271 WARN ha.ActiveStandbyElector (ActiveStandbyElector.java:isStaleClient(1066)) - Ignoring stale result from old client with sessionId 0x15f5451238a08b3 
2018-02-11 22:32:22,271 INFO zookeeper.ClientCnxn (ClientCnxn.java:run(524)) - EventThread shut down

4.【时间】=> 22:32:22
【HA状态】=> name21.hadoop：Active, name22.hadoop：Standyby
【日志信息】=> name22.hadoop：/var/log/hadoop/hdfs/hadoop-hdfs-zkfc-name22.hadoop.log

与此同时， name22.hadoop的namenode与zookeeper建立session连接， ZKFailoverController 尝试连接name21.hadoop上的Namenode，将状态从Active修改为Standby，报错连接超时；

2018-02-11 22:32:22,286 INFO ha.ActiveStandbyElector (ActiveStandbyElector.java:fenceOldActive(931)) - Checking for any old active which needs to be fenced... 
2018-02-11 22:32:22,290 INFO ha.ActiveStandbyElector (ActiveStandbyElector.java:fenceOldActive(952)) - Old node exists: 0a06636261734e4112036e6e311a0e6e616d6532312e6861646f6f703220d53e28d33e
2018-02-11 22:32:22,294 INFO ha.ZKFailoverController (ZKFailoverController.java:doFence(513)) - Should fence: NameNode at name21.hadoop/ xxx.xx.xx.xx:8021 
2018-02-11 22:32:27,304 WARN ha.FailoverController (FailoverController.java:tryGracefulFence(178)) - Unable to gracefully make NameNode at name21.hadoop/ xxx.xx.xx.xx:8021 standby (unable to connect) java.net.SocketTimeoutException: Call From name22.hadoop/ xxx.xx.xx.xx to name21.hadoop:8021 failed on socket timeout exception: java.net.SocketTimeoutException: 5000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/ xxx.xx.xx.xx :56511remote=name21.hadoop/ xxx.xx.xx.xx :8021];  
......
Caused by: **java.net.SocketTimeoutException: 5000 millis timeout** while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/ xxx.xx.xx.xx :56511 remote=name21.hadoop/ xxx.xx.xx.xx :8021] 
......

启动Fence服务（集群并未设置，直接返回ture），并将name22.hadoop Namenode的状态从Standby变为Active;

2018-02-11 22:32:27,323 INFO ha.NodeFencer (NodeFencer.java:fence(91)) - ====== Beginning Service Fencing Process... ====== 
2018-02-11 22:32:27,323 INFO ha.NodeFencer (NodeFencer.java:fence(94)) - Trying method 1/1: org.apache.hadoop.ha.ShellCommandFencer(/bin/true) 
2018-02-11 22:32:27,339 INFO ha.ShellCommandFencer (ShellCommandFencer.java:tryFence(99)) - Launched fencing command '/bin/true' with pid 48839 
2018-02-11 22:32:27,359 INFO ha.NodeFencer (NodeFencer.java:fence(98)) - ====== Fencing successful by method org.apache.hadoop.ha.ShellCommandFencer(/bin/true) ====== 
2018-02-11 22:32:27,360 INFO ha.ActiveStandbyElector (ActiveStandbyElector.java:writeBreadCrumbNode(878)) - Writing znode /hadoop-ha/cbasNA/ActiveBreadCrumb to indicate that the local node is the most recent active... 
2018-02-11 22:32:27,365 INFO ha.ZKFailoverController (ZKFailoverController.java:becomeActive(380)) - Trying to make NameNode at name22.hadoop/ xxx.xx.xx.xx :8021 active... 
2018-02-11 22:32:46,337 INFO ha.ZKFailoverController (ZKFailoverController.java:becomeActive(387)) - Successfully transitioned NameNode at name22.hadoop/ xxx.xx.xx.xx :8021 to active state

5.【时间】=> 22:32:46
【HA状态】=> name21.hadoop：Active, name22.hadoop：Active
至此，集群的NameNode出现双Active状态，发生脑裂；

6.【恢复措施】
后面运维介入时，尝试重启namenode，发现一直卡在35%进度，无法在执行下去；

原因：运维介入时，首先尝试重启name22.hadoop上的namenode，此时重启程序检测到双active，程序会反复获取状态，直到active只有一个；

NameNode HA states: active_namenodes = [(‘nn1’, ‘name21.hadoop:50070’), (‘nn2’, ‘name22.hadoop:50070’)], standby_namenodes = [], unknown_namenodes = []

此时出现的情况是：
name21.hadoop的namenode异常(8021端口连接超时)，无法将自身状态从active修改为standby；
name22.hadoop的namenode已经被ZKFailoverController选举为active，选举理由也是name21.hadoop的8021端口连接异常；

因此，需要重启name21.hadoop上的namenode(恢复8021端口)，重启时会检测name22.hadoop的namenode状态（检测name22.hadoop的8021端口），将自身从active切换为standby；name22.hadoop无需重启；

7.【预防措施】
a. 优化GC参数，添加namenode的GC监控，找出GC异常的原因并修复；
b. 提高journalNode的写入超时时间，目前默认值是20s，对于生产环境有必要提高到60s;
c. 启动Fence服务；
d. 集群namenode节点的状态监控；

博客主页：https://www.jianshu.com/u/e97bb429f278

    原文作者：伍柒大人的三言两语
    原文地址: https://www.jianshu.com/p/ecf5025302f9
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。