问题现象:hive执行sql报错
select count(*) from test_hive_table;
报错
Error: java.io.IOException: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=36, exceptions:
Wed May 16 10:22:17 CST 2018, null, java.net.SocketTimeoutException: callTimeout=60000, callDuration=68270: row ‘6495922803d09’ on table ‘test_hbase_table’ at region=test_hbase_table,6495922803d09,1526207008130.6060407a4aa5e23c8df09614dd3fe650., hostname=server121,16020,1526398855988, seqNum=47468
直接原因一:region状态
这个hive表是外部表,底层数据存放在hbase中,出现问题的region状态:
2018-05-16 03:18:16 在RegionServer server121 上线;
2018-05-16 06:44:13 RegionServer server121 挂了,region下线;(有15个小时处于下线状态)
2018-05-16 21:25:33 在RegionServer server92 上线;
2018-05-17 04:10:44 RegionServer server92 挂了,region下线;(有10个小时处于下线状态)
2018-05-17 14:27:33 在RegionServer server114 上线;
2018-05-16 10:22:17 查询报错的时间正好处于region下线状态,所以报错的直接原因是hbase有一个region当时不可用,问题转移到Hbase,以下为详细hbase日志:
HMaster server132日志:
————————————–
2018-05-16 03:18:12,838 INFO [ProcedureExecutor-15] master.RegionStates: Transition {6060407a4aa5e23c8df09614dd3fe650 state=OPEN, ts=1526370112734, server=server76,16020,1526343293931} to {
6060407a4aa5e23c8df09614dd3fe650 state=OFFLINE, ts=1526411892838, server=server76,16020,1526343293931}
2018-05-16 03:18:15,861 INFO [server132,16000,1525800676562-GeneralBulkAssigner-9] master.RegionStates: Transition {6060407a4aa5e23c8df09614dd3fe650 state=OFFLINE, ts=1526411895787, server=server76,16020,1526343293931} to {6060407a4aa5e23c8df09614dd3fe650 state=PENDING_OPEN, ts=1526411895861, server=server121,16020,1526398855988}
2018-05-16 03:18:16,311 INFO [AM.ZK.Worker-pool2-t9606] master.RegionStates: Transition {6060407a4aa5e23c8df09614dd3fe650 state=PENDING_OPEN, ts=1526411895861, server=server121,16020,1526398855988}
to {6060407a4aa5e23c8df09614dd3fe650 state=OPENING, ts=1526411896311, server=server121,16020,1526398855988}
2018-05-16 03:18:16,445 INFO [AM.ZK.Worker-pool2-t9610] master.RegionStates: Transition {6060407a4aa5e23c8df09614dd3fe650 state=OPENING, ts=1526411896311, server=server121,16020,1526398855988} to {
6060407a4aa5e23c8df09614dd3fe650 state=OPEN, ts=1526411896445, server=server121,16020,1526398855988}
2018-05-16 03:18:16,480 INFO [AM.ZK.Worker-pool2-t9618] master.RegionStates: Offlined 6060407a4aa5e23c8df09614dd3fe650 from server76,16020,1526343293931
2018-05-16 14:21:44,565 INFO [B.defaultRpcServer.handler=183,queue=3,port=16000] master.MasterRpcServices: Client=hadoop//client assign test_hbase_table,6495922803d09,1526207008130.6060407a4aa5e23c8df09614dd3fe650.
2018-05-16 14:21:44,565 INFO [B.defaultRpcServer.handler=183,queue=3,port=16000] master.RegionStates: Transition {6060407a4aa5e23c8df09614dd3fe650 state=OPEN, ts=1526411896480, server=server121,16020,1526398855988} to {6060407a4aa5e23c8df09614dd3fe650 state=OFFLINE, ts=1526451704565, server=server121,16020,1526398855988}
2018-05-16 14:21:44,566 INFO [B.defaultRpcServer.handler=183,queue=3,port=16000] master.AssignmentManager: Skip assigning test_hbase_table,6495922803d09,1526207008130.6060407a4aa5e23c8df09614dd3fe650., it
is on a dead but not processed yet server: server121,16020,1526398855988
2018-05-16 21:25:32,030 INFO [ProcedureExecutor-9] master.RegionStates: Transitioning {6060407a4aa5e23c8df09614dd3fe650 state=OFFLINE, ts=1526451704566, server=server121,16020,1526398855988} will be handled by ServerCrashProcedure for server121,16020,1526398855988
2018-05-16 21:25:32,489 INFO [ProcedureExecutor-9] procedure.ServerCrashProcedure: Reassigning region {6060407a4aa5e23c8df09614dd3fe650 state=OFFLINE, ts=1526451704566, server=server121,16020,1526398855988} and clearing zknode if exists
2018-05-16 21:25:32,938 INFO [server132,16000,1525800676562-GeneralBulkAssigner-9] master.RegionStates: Transition {6060407a4aa5e23c8df09614dd3fe650 state=OFFLINE, ts=1526477132893, server=server121,16020,1526398855988} to {6060407a4aa5e23c8df09614dd3fe650 state=PENDING_OPEN, ts=1526477132938, server=server92,16020,1526473170596}
2018-05-16 21:25:33,078 INFO [AM.ZK.Worker-pool2-t10955] master.RegionStates: Transition {6060407a4aa5e23c8df09614dd3fe650 state=PENDING_OPEN, ts=1526477132938, server=server92,16020,1526473170596}
to {6060407a4aa5e23c8df09614dd3fe650 state=OPENING, ts=1526477133078, server=server92,1526473170596}
2018-05-16 21:25:33,895 INFO [AM.ZK.Worker-pool2-t10956] master.RegionStates: Transition {6060407a4aa5e23c8df09614dd3fe650 state=OPENING, ts=1526477133078, server=server92,16020,1526473170596} to {
6060407a4aa5e23c8df09614dd3fe650 state=OPEN, ts=1526477133895, server=server92,16020,1526473170596}
2018-05-16 21:25:33,940 INFO [AM.ZK.Worker-pool2-t10956] master.RegionStates: Offlined 6060407a4aa5e23c8df09614dd3fe650 from server121,16020,1526398855988
RegionServer server121日志:
————————————–
2018-05-16 03:18:16,390 INFO [RS_OPEN_REGION-server121:16020-0] regionserver.HRegion: Onlined 6060407a4aa5e23c8df09614dd3fe650; next sequenceid=47468
2018-05-16 06:44:13,077 FATAL [regionserver/server121/server121:16020.logRoller] regionserver.HRegionServer: ABORTING region server server121,16020,1526398855988: Failed log close in log roller
2018-05-16 06:44:13,218 INFO [StoreCloserThread-test_hbase_table,6495922803d09,1526207008130.6060407a4aa5e23c8df09614dd3fe650.-1] regionserver.HStore: Closed cf_p
2018-05-16 06:44:13,218 INFO [RS_CLOSE_REGION-server121:16020-0] regionserver.HRegion: Closed test_hbase_table,6495922803d09,1526207008130.6060407a4aa5e23c8df09614dd3fe650.
RegionServer server92日志:
————————————–
2018-05-16 21:25:33,837 INFO [RS_OPEN_REGION-server92:16020-1] regionserver.HRegion: Onlined 6060407a4aa5e23c8df09614dd3fe650; next sequenceid=48261
2018-05-17 04:10:44,590 FATAL [regionserver/server92/server92:16020.logRoller] regionserver.HRegionServer: ABORTING region server server92,16020,1526473170596: Failed log close in log roller
2018-05-17 04:10:44,716 INFO [StoreCloserThread-test_hbase_table,6495922803d09,1526207008130.6060407a4aa5e23c8df09614dd3fe650.-1] regionserver.HStore: Closed cf_p
2018-05-17 04:10:44,716 INFO [RS_CLOSE_REGION-server92:16020-1] regionserver.HRegion: Closed test_hbase_table,6495922803d09,1526207008130.6060407a4aa5e23c8df09614dd3fe650.
直接原因二:RegionServer进程频繁退出
另外最近RegionServer非常不稳定,以RegionServer server121为例,最近几天进程频繁退出
RegionServer退出日志:
————————————–
2018-05-15 00:09:26,412 FATAL [regionserver/server121/server121:16020.logRoller] regionserver.HRegionServer: ABORTING region server server121,16020,1526299719028: Failed log close in log roller
2018-05-15 08:07:39,293 FATAL [regionserver/server121/server121:16020.logRoller] regionserver.HRegionServer: ABORTING region server server121,16020,1526314169508: Failed log close in log roller
2018-05-15 12:58:12,082 FATAL [regionserver/server121/server121:16020.logRoller] regionserver.HRegionServer: ABORTING region server server121,16020,1526342866588: Failed log close in log roller
2018-05-15 21:40:19,032 FATAL [regionserver/server121/server121:16020.logRoller] regionserver.HRegionServer: ABORTING region server server121,16020,1526360301267: Failed log close in log roller
2018-05-15 23:40:47,297 FATAL [regionserver/server121/server121:16020.logRoller] regionserver.HRegionServer: ABORTING region server server121,16020,1526391631046: Failed log close in log roller
2018-05-16 06:44:13,077 FATAL [regionserver/server121/server121:16020.logRoller] regionserver.HRegionServer: ABORTING region server server121,16020,1526398855988: Failed log close in log roller
2018-05-16 16:16:13,501 FATAL [regionserver/server121/server121:16020.logRoller] regionserver.HRegionServer: ABORTING region server server121,16020,1526449759526: Failed log close in log roller
2018-05-16 21:54:02,776 FATAL [regionserver/server121/server121:16020.logRoller] regionserver.HRegionServer: ABORTING region server server121,16020,1526458576939: Failed log close in log roller
2018-05-17 03:27:20,017 FATAL [regionserver/server121/server121:16020.logRoller] regionserver.HRegionServer: ABORTING region server server121,16020,1526478847956: Failed log close in log roller
2018-05-17 16:03:16,978 FATAL [regionserver/server121/server121:16020.logRoller] regionserver.HRegionServer: ABORTING region server server121,16020,1526498846730: Failed log close in log roller
RegionServer启动日志:
————————————–
Tue May 15 00:09:28 CST 2018 Starting regionserver on server121
Tue May 15 08:07:45 CST 2018 Starting regionserver on server121
Tue May 15 12:58:20 CST 2018 Starting regionserver on server121
Tue May 15 21:40:29 CST 2018 Starting regionserver on server121
Tue May 15 23:40:54 CST 2018 Starting regionserver on server121
Wed May 16 06:44:20 CST 2018 Starting regionserver on server121
Wed May 16 13:49:18 CST 2018 Starting regionserver on server121
Wed May 16 16:16:15 CST 2018 Starting regionserver on server121
Wed May 16 21:54:06 CST 2018 Starting regionserver on server121
Thu May 17 03:27:25 CST 2018 Starting regionserver on server121
Thu May 17 16:03:21 CST 2018 Starting regionserver on server121
问题一:为什么Hbase的RegionServer会频繁退出?
2018-05-16 06:44:13,077 FATAL [regionserver/server121/server121:16020.logRoller] regionserver.HRegionServer: ABORTING region server server121,16020,1526398855988: Failed log close in log roller
org.apache.hadoop.hbase.regionserver.wal.FailedLogCloseException: hdfs://hdfs_name/hbase/WALs/server121,16020,1526398855988/server121%2C16020%2C1526398855988.default.1526421652184, unflushedEntries=133
at org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:907)
at org.apache.hadoop.hbase.regionserver.wal.FSHLog.rollWriter(FSHLog.java:709)
at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:148)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.hbase.regionserver.wal.FailedSyncBeforeLogCloseException: org.apache.hadoop.hbase.regionserver.wal.DamagedWALException: Append sequenceId=10739149, requesting roll of WAL
at org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.checkIfSyncFailed(FSHLog.java:1650)
at org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.waitSafePoint(FSHLog.java:1669)
at org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:858)
… 3 more
Caused by: org.apache.hadoop.hbase.regionserver.wal.DamagedWALException: Append sequenceId=10739149, requesting roll of WAL
at org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.append(FSHLog.java:1970)
at org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1814)
at org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1724)
at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:128)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
… 1 more
Caused by: java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[server82:50010], original=[server82:50010]).
The current failed datanode replacement policy is ALWAYS, and a client may configure this via ‘dfs.client.block.write.replace-datanode-on-failure.policy’ in its configuration.
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:1040)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1106)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1253)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1004)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:548)
这里报错是因为获取新的datanode失败,HDFS代码解析详见 https://www.cnblogs.com/barneywill/p/10114504.html
大概过程:由于机架过多,maxNodesPerRack计算结果为1(计算过程见问题二),即每个机架最多只能分配一个datanode,然后整个hdfs集群只有一个机架可用,最后numOfAvailableNodes计算结果为0,然后会抛出NotEnoughReplicasException
问题二:为什么出问题的HDFS文件只有一个副本?
hdfs://hdfs_name/hbase/WALs/server121,16020,1526398855988/server121%2C16020%2C1526398855988.default.1526421652184
Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[server82:50010], original=[server82:50010]).
HDFS代码解析详见 https://www.cnblogs.com/barneywill/p/10114504.html
注释:
int maxNodesPerRack = (totalNumOfReplicas-1)/numOfRacks + 2;
结果:maxNodesPerRack = (2-1)/5+2 = 2;
if (maxNodesPerRack == totalNumOfReplicas) {
maxNodesPerRack–;
}
结果:maxNodesPerRack = 1;
由于机架数量过多,导致一个机架最多只能分配一个datanode,结合目前的HDFS机架结构,只有一个机架可用,所以只有1个副本
问题三:为什么Region从OFFLINE到ONLINE需要很长时间?
以出问题的Region从2018-05-16 06:44:13下线到2018-05-16 21:25:33上线过程来看
Hmaster日志:
2018-05-16 06:44:13,437 INFO [main-EventThread] zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, processing expiration [server121,16020,1526398855988]
2018-05-16 06:44:13,633 INFO [ProcedureExecutor-2] procedure.ServerCrashProcedure: Start processing crashed server121,16020,1526398855988
2018-05-16 06:44:13,895 INFO [ProcedureExecutor-8] master.SplitLogManager: dead splitlog workers [server121,16020,1526398855988]
2018-05-16 06:44:13,896 INFO [ProcedureExecutor-8] master.SplitLogManager: Started splitting 3 logs in [hdfs://hdfs_name/hbase/WALs/server121,16020,1526398855988-splitting] for [server121,16020,1526398855988]
2018-05-16 06:44:13,911 INFO [main-EventThread] coordination.SplitLogManagerCoordination: task /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526418989998 acquired by server94,16020,1526360475758
2018-05-16 06:44:13,919 INFO [main-EventThread] coordination.SplitLogManagerCoordination: task /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526424253042 acquired by server124,16020,1526398058280
2018-05-16 06:44:13,919 INFO [main-EventThread] coordination.SplitLogManagerCoordination: task /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526421652184 acquired by server119,16020,1526383578672
region server挂掉之后master开始server crash处理,首先尝试split log,这里一共有3个log,其中2个log很快处理完:
2018-05-16 06:44:13,964 INFO [main-EventThread] coordination.SplitLogManagerCoordination: Done splitting /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%
252C16020%252C1526398855988.default.1526424253042
2018-05-16 06:46:01,875 INFO [main-EventThread] coordination.SplitLogManagerCoordination: Done splitting /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%
252C16020%252C1526398855988.default.1526421652184
但是有1个一直在处理并且反复报错:
2018-05-16 07:56:32,848 WARN [main-EventThread] coordination.SplitLogManagerCoordination: Error splitting /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526418989998
2018-05-16 09:08:56,761 WARN [main-EventThread] coordination.SplitLogManagerCoordination: Error splitting /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526418989998
2018-05-16 10:21:29,120 WARN [main-EventThread] coordination.SplitLogManagerCoordination: Error splitting /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526418989998
2018-05-16 11:47:19,093 WARN [main-EventThread] coordination.SplitLogManagerCoordination: Error splitting /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526418989998
2018-05-16 12:59:40,008 WARN [main-EventThread] coordination.SplitLogManagerCoordination: Error splitting /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526418989998
2018-05-16 14:12:18,594 WARN [main-EventThread] coordination.SplitLogManagerCoordination: Error splitting /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526418989998
2018-05-16 15:24:46,758 WARN [main-EventThread] coordination.SplitLogManagerCoordination: Error splitting /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526418989998
2018-05-16 16:37:17,790 WARN [main-EventThread] coordination.SplitLogManagerCoordination: Error splitting /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526418989998
2018-05-16 18:06:36,864 WARN [main-EventThread] coordination.SplitLogManagerCoordination: Error splitting /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526418989998
2018-05-16 19:18:50,573 WARN [main-EventThread] coordination.SplitLogManagerCoordination: Error splitting /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526418989998
2018-05-16 20:37:30,438 WARN [main-EventThread] coordination.SplitLogManagerCoordination: Error splitting /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526418989998
2018-05-16 21:17:56,062 WARN [main-EventThread] coordination.SplitLogManagerCoordination: Error splitting /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526418989998
发现该log文件的split过程从6点44到21点多一直处于报错重试过程,具体报错看第一次split过程日志:
2018-05-16 06:44:13,911 INFO [SplitLogWorker-server94:16020] coordination.ZkSplitLogWorkerCoordination: worker server94,16020,1526360475758 acquired task /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526418989998
2018-05-16 06:44:13,932 INFO [RS_LOG_REPLAY_OPS-server94:16020-0] wal.WALSplitter: Splitting wal: hdfs://hdfs_name/hbase/WALs/server121,16020,1526398855988-splitting/server121%2C16020%2C1526398855988.default.1526418989998, length=83
2018-05-16 06:44:13,932 INFO [RS_LOG_REPLAY_OPS-server94:16020-0] wal.WALSplitter: DistributedLogReplay = false
2018-05-16 06:44:13,932 INFO [RS_LOG_REPLAY_OPS-server94:16020-0] util.FSHDFSUtils: Recover lease on dfs file hdfs://hdfs_name/hbase/WALs/server121,16020,1526398855988-splitting/server121%2C16020%2C1526398855988.default.1526418989998
2018-05-16 06:44:13,934 INFO [RS_LOG_REPLAY_OPS-server94:16020-0] util.FSHDFSUtils: Failed to recover lease, attempt=0 on file=hdfs://hdfs_name/hbase/WALs/server121,16020,1526398855988-splitting/server121%2C16020%2C1526398855988.default.1526418989998 after 1ms
2018-05-16 06:44:17,935 INFO [RS_LOG_REPLAY_OPS-server94:16020-0] util.FSHDFSUtils: Failed to recover lease, attempt=1 on file=hdfs://hdfs_name/hbase/WALs/server121,16020,1526398855988-splitting/server121%2C16020%2C1526398855988.default.1526418989998 after 4002ms
2018-05-16 06:45:22,051 INFO [RS_LOG_REPLAY_OPS-server94:16020-0] util.FSHDFSUtils: Failed to recover lease, attempt=2 on file=hdfs://hdfs_name/hbase/WALs/server121,16020,1526398855988-splitting/server121%2C16020%2C1526398855988.default.1526418989998 after 68118ms
2018-05-16 06:47:30,834 INFO [RS_LOG_REPLAY_OPS-server94:16020-0] util.FSHDFSUtils: Failed to recover lease, attempt=3 on file=hdfs://hdfs_name/hbase/WALs/server121,16020,1526398855988-splitting/server121%2C16020%2C1526398855988.default.1526418989998 after 196901ms
2018-05-16 06:50:43,622 INFO [RS_LOG_REPLAY_OPS-server94:16020-0] util.FSHDFSUtils: Failed to recover lease, attempt=4 on file=hdfs://hdfs_name/hbase/WALs/server121,16020,1526398855988-splitting/server121%2C16020%2C1526398855988.default.1526418989998 after 389689ms
2018-05-16 06:55:00,314 INFO [RS_LOG_REPLAY_OPS-server94:16020-0] util.FSHDFSUtils: Failed to recover lease, attempt=5 on file=hdfs://hdfs_name/hbase/WALs/server121,16020,1526398855988-splitting/server121%2C16020%2C1526398855988.default.1526418989998 after 646381ms
2018-05-16 07:00:20,600 INFO [RS_LOG_REPLAY_OPS-server94:16020-0] util.FSHDFSUtils: Failed to recover lease, attempt=6 on file=hdfs://hdfs_name/hbase/WALs/server121,16020,1526398855988-splitting/server121%2C16020%2C1526398855988.default.1526418989998 after 966667ms
2018-05-16 07:01:09,660 WARN [RS_LOG_REPLAY_OPS-server94:16020-0] hdfs.DFSClient: Failed to connect to /server28:50010 for block, add to deadNodes and continue. java.io.IOException: Got error for OP_READ_BLOCK, self=/server94:26354, remote=/server28:50010, for file /hbase/WALs/server121,16020,1526398855988-splitting/server121%2C16020%2C1526398855988.default.1526418989998, for pool BP-436366437-server131-1493982655699 block 1361314257_288238951
2018-05-16 07:01:59,868 WARN [RS_LOG_REPLAY_OPS-server94:16020-0] hdfs.DFSClient: Failed to connect to /server28:50010 for block, add to deadNodes and continue. java.io.IOException: Got error for OP_READ_BLOCK, self=/server94:28464, remote=/server28:50010, for file /hbase/WALs/server121,16020,1526398855988-splitting/server121%2C16020%2C1526398855988.default.1526418989998, for pool BP-436366437-server131-1493982655699 block 1361314257_288238951
2018-05-16 07:02:56,853 WARN [RS_LOG_REPLAY_OPS-server94:16020-0] hdfs.DFSClient: Failed to connect to /server28:50010 for block, add to deadNodes and continue. java.io.IOException: Got error for OP_READ_BLOCK, self=/server94:30766, remote=/server28:50010, for file /hbase/WALs/server121,16020,1526398855988-splitting/server121%2C16020%2C1526398855988.default.1526418989998, for pool BP-436366437-server131-1493982655699 block 1361314257_288238951
2018-05-16 07:03:56,370 WARN [RS_LOG_REPLAY_OPS-server94:16020-0] hdfs.DFSClient: Failed to connect to /server28:50010 for block, add to deadNodes and continue. java.io.IOException: Got error for OP_READ_BLOCK, self=/server94:33214, remote=/server28:50010, for file /hbase/WALs/server121,16020,1526398855988-splitting/server121%2C16020%2C1526398855988.default.1526418989998, for pool BP-436366437-server131-1493982655699 block 1361314257_288238951
发现是读hdfs文件超时,具体报错:
2018-05-16 07:01:09,660 WARN [RS_LOG_REPLAY_OPS-server94:16020-0] hdfs.DFSClient: Failed to connect to /server28:50010 for block, add to deadNodes and continue. java.io.IOException: Got error for OP_READ_BLOCK, self=/server94:26354, remote=/server28:50010, for file /hbase/WALs/server121,16020,1526398855988-splitting/server121%2C16020%2C1526398855988.default.1526418989998, for pool BP-436366437-server131-1493982655699 block 1361314257_288238951
java.io.IOException: Got error for OP_READ_BLOCK, self=/server94:26354, remote=/server28:50010, for file /hbase/WALs/server121,16020,1526398855988-splitting/server121%2C16020%2C
1526398855988.default.1526418989998, for pool BP-436366437-server131-1493982655699 block 1361314257_288238951
at org.apache.hadoop.hdfs.RemoteBlockReader2.checkSuccess(RemoteBlockReader2.java:445)
at org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:410)
at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:819)
at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:697)
at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:354)
at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:576)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:800)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:848)
at java.io.DataInputStream.read(DataInputStream.java:100)
at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:299)
at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:267)
at org.apache.hadoop.hbase.wal.WALSplitter.getReader(WALSplitter.java:853)
at org.apache.hadoop.hbase.wal.WALSplitter.getReader(WALSplitter.java:777)
at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:298)
at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:236)
at org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:104)
at org.apache.hadoop.hbase.regionserver.handler.WALSplitterHandler.process(WALSplitterHandler.java:72)
at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2018-05-16 07:01:09,661 INFO [RS_LOG_REPLAY_OPS-server94:16020-0] hdfs.DFSClient: Could not obtain BP-436366437-server131-1493982655699:blk_1361314257_288238951 from any node: java.io.IOException: No live
nodes contain block BP-436366437-server131-1493982655699:blk_1361314257_288238951 after checking nodes = [server127:50010, server28:50010], ignoredNodes = null No live nodes contain current block Block
locations: server127:50010 server28:50010 Dead nodes: server127:50010 server28:50010. Will get new block locations from namenode and retry…
2018-05-16 07:01:09,661 WARN [RS_LOG_REPLAY_OPS-server94:16020-0] hdfs.DFSClient: DFS chooseDataNode: got # 1 IOException, will wait for 1196.0361088580376 msec.
2018-05-16 07:01:14,861 INFO [RS_LOG_REPLAY_OPS-server94:16020-0] ipc.Client: Retrying connect to server: server127/server127:50020. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
直到2018-05-16 21:25:31终于split完成
2018-05-16 21:25:31,858 INFO [main-EventThread] coordination.SplitLogManagerCoordination: task /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526418989998 entered state: DONE server76,16020,1526424280100
2018-05-16 21:25:31,868 INFO [main-EventThread] wal.WALSplitter: Archived processed log hdfs://hdfs_name/hbase/WALs/server121,16020,1526398855988-splitting/server121%2C16020%2C1526398855988.default.1526418989998 to hdfs://hdfs_name/hbase/oldWALs/server121%2C16020%2C1526398855988.default.1526418989998
2018-05-16 21:25:31,869 INFO [main-EventThread] coordination.SplitLogManagerCoordination: Done splitting /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526418989998
2018-05-16 21:25:31,870 INFO [ProcedureExecutor-11] master.SplitLogManager: finished splitting (more than or equal to) 83 bytes in 1 log files in [hdfs://hdfs_name/hbase/WALs/server121,16020,1526398855988-splitting] in 455631ms
然后有很多region被assign,包括出问题的region,这个region在RegionServer server92 上线,详见之前的日志,assign过程代码详见:
org.apache.hadoop.hbase.master.AssignmentManager.assign
问题总结
归根结底,所有的问题都是由于HDFS机架配置导致的,有的服务器配置机架信息,有的服务器没有配置机架信息,导致机架数量特别多,而且机架中的服务器分配不均,由hdfs分配策略可计算出这种情况下一个机架最多只能分配一个datanode,同时大量的机架上的datanode剩余空间为0,所以很多文件只有1个副本,这样就造成单点,一旦有datanode节点状态异常,就可能导致大量文件不可用,进而导致一系列的其他问题,比如上边的hbase region长时间处于下线状态。
问题发现于hive,直接原因是hbase,根本原因是hdfs,本质是因为rack配置不当;
hdfs中机架相关配置:
<property>
<name>net.topology.script.file.name</name>
<value>/export/App/hadoop-2.6.1/etc/hadoop/rack.py</value>
</property>