源于一个已经做了预分区的表,region发生了分裂
定位发生的时间, 是不是在load hfile的时候 ?
发生的原因, 是不是单个hfile太大了 ?
后续该如何避免?
以下分析基于hbase branch-1.2的分支
1. 定位发生的时间
通过对比导入hfiles
的时间和发生split的时间,可以很明显的看出,不是在load
的时候出的问题, 通过查看regionserver
的日志,可以看到以下几条:
2018-12-04 01:43:50,505 INFO org.apache.hadoop.hbase.regionserver.CompactSplitThread: Completed compaction xxx
2018-12-04 01:43:50,571 INFO org.apache.hadoop.hbase.regionserver.SplitTransaction: Starting split of region xxx
2018-12-04 01:43:51,252 INFO org.apache.hadoop.hbase.regionserver.SplitTransaction: Preparing to split 1 storefiles for region xxx
2018-12-04 01:43:52,669 INFO org.apache.hadoop.hbase.regionserver.SplitRequest: Region split, hbase:meta updated, and report to master. Parent= xxx
可以看出是在做完compact
之后,触发了split
的检查,然后检查下来要split
才有了上面的日志, 具体源码如下:
收到compact
的请求之后,起一个CompactionRunner
来跑真正的compact
// org.apache.hadoop.hbase.regionserver.CompactSplitThread
private synchronized CompactionRequest requestCompactionInternal(final Region r, final Store s, final String why, int priority, CompactionRequest request, boolean selectNow, User user) {
// ....
ThreadPoolExecutor pool = (selectNow && s.throttleCompaction(compaction.getRequest().getSize()))
? longCompactions : shortCompactions;
pool.execute(new CompactionRunner(s, r, compaction, pool, user));
// ....
}
doCompaction
里面进行compact
和split
检查及split
// CompactionRunner
private void doCompaction(User user) {
// ....
long start = EnvironmentEdgeManager.currentTime();
boolean completed =
region.compact(compaction, store, compactionThroughputController, user);
long now = EnvironmentEdgeManager.currentTime();
LOG.info(((completed) ? "Completed" : "Aborted") + " compaction: " +
this + "; duration=" + StringUtils.formatTimeDiff(now, start));
if (completed) {
// degenerate case: blocked regions require recursive enqueues
if (store.getCompactPriority() <= 0) {
requestSystemCompaction(region, store, "Recursive enqueue");
} else {
// see if the compaction has caused us to exceed max region size
requestSplit(region);
}
}
// ....
}
在做compact
的线程里面再起一个线程去执行split
public synchronized void requestSplit(final Region r, byte[] midKey, User user) {
// ...
this.splits.execute(new SplitRequest(r, midKey, this.server, user));
// ...
}
2. 定位发生的原因
从上面可以知道,是因为compact
之后触发了split
, 现在需要知道为什么split检查通过了 ?
public synchronized boolean requestSplit(final Region r) {
// don't split regions that are blocking
if (shouldSplitRegion() && ((HRegion)r).getCompactPriority() >= Store.PRIORITY_USER) {
byte[] midKey = ((HRegion)r).checkSplit();
if (midKey != null) {
requestSplit(r, midKey);
return true;
}
}
return false;
}
要进入requestSplit
需要满足2个条件:
shouldSplitRegion() && ((HRegion)r).getCompactPriority() >= Store.PRIORITY_USER
-
midKey != null
(midKey = ((HRegion)r).checkSplit()
)
可以肯定的是发生了split
的话,上述2个条件一定都满足了
1. 检查shouldSplitRegion
private boolean shouldSplitRegion() {
if(server.getNumberOfOnlineRegions() > 0.9*regionSplitLimit) {
LOG.warn("Total number of regions is approaching the upper limit " + regionSplitLimit + ". "
+ "Please consider taking a look at http://hbase.apache.org/book.html#ops.regionmgt");
}
return (regionSplitLimit > server.getNumberOfOnlineRegions());
}
以上代码涉及到2个变量:
-
server.getNumberOfOnlineRegions()
: 当前regionserver
的在线region
的个数 -
regionSplitLimit
: 整个hbase集群中region
数的上限,regionSplitLimit = conf.getInt(REGION_SERVER_REGION_SPLIT_LIMIT, DEFAULT_REGION_SERVER_REGION_SPLIT_LIMIT)
因为当前hbase设置了整个参数并且非常大, 所以这个检查是通过的
2. 检查checkSplit
和midKey
public byte[] checkSplit() {
// Can't split META
if (this.getRegionInfo().isMetaTable() ||
TableName.NAMESPACE_TABLE_NAME.equals(this.getRegionInfo().getTable())) {
if (shouldForceSplit()) {
LOG.warn("Cannot split meta region in HBase 0.20 and above");
}
return null;
}
// Can't split region which is in recovering state
if (this.isRecovering()) {
LOG.info("Cannot split region " + this.getRegionInfo().getEncodedName() + " in recovery.");
return null;
}
if (!splitPolicy.shouldSplit()) {
return null;
}
byte[] ret = splitPolicy.getSplitPoint();
if (ret != null) {
try {
checkRow(ret, "calculated split");
} catch (IOException e) {
LOG.error("Ignoring invalid split", e);
return null;
}
}
return ret;
}
从上面可以看到重点在 splitPolicy.shouldSplit()
, 而这个splitPolicy
因为建表的时候,没有指定,所以用的是默认的,即: IncreasingToUpperBoundRegionSplitPolicy
@Override
protected boolean shouldSplit() {
boolean force = region.shouldForceSplit();
boolean foundABigStore = false;
// Get count of regions that have the same common table as this.region
int tableRegionsCount = getCountOfCommonTableRegions();
// Get size to check
long sizeToCheck = getSizeToCheck(tableRegionsCount);
for (Store store : region.getStores()) {
// If any of the stores is unable to split (eg they contain reference files)
// then don't split
if (!store.canSplit()) {
return false;
}
// Mark if any store is big enough
long size = store.getSize();
if (size > sizeToCheck) {
LOG.debug("ShouldSplit because " + store.getColumnFamilyName() + " size=" + size
+ ", sizeToCheck=" + sizeToCheck + ", regionsWithCommonTable="
+ tableRegionsCount);
foundABigStore = true;
}
}
return foundABigStore | force;
}
从上面可以看出在对region
下的每个store
做检查,来判断是不是进行split
, 这个判断中用到了3个变量:
-
tableRegionsCount
: 在当前这个regionserver
上的,当前表的region
个数 -
sizeToCheck
: 获取一个文件大小的阈值, 这个阈值不止和当前集群的最大hfile的大小有关,还和initialSize, tableRegionsCount有关 -
size
: 当前hfile
的大小
protected long getSizeToCheck(final int tableRegionsCount) {
// safety check for 100 to avoid numerical overflow in extreme cases
return tableRegionsCount == 0 || tableRegionsCount > 100
? getDesiredMaxFileSize()
: Math.min(getDesiredMaxFileSize(),
initialSize * tableRegionsCount * tableRegionsCount * tableRegionsCount);
}
这个方法需要着重注意一下:
getDesiredMaxFileSize()
: 获取的是一般是hbase.hregion.max.filesize
, 默认10g
initialSize
: 一般是2倍的hbase.hregion.memstore.flush.size
, 默认128M
, 我们这边自己的环境是256M
: 这个参数比较坑, 当
tableRegionsCount=1
的时候为1, 当tableRegionsCount>1
的时候很大
根据这些,我带入了当时集群的参数算了一下:
-
getDesiredMaxFileSize()
:10g
-
initialSize
:256M
-
tableRegionsCount
: 2
这样算下来: sizeToCheck=4g
, size
通过hfds
命令看下来也就1.4g
左右, 远远达不到split
的要求
这里犯了个错误, tableRegionsCount
, 看的是split
之后的情况, 应该是split
之前的情况, 这时候看了下HBase UI, 发现了问题:
hbase_region_split_vs_nonsplit_v2.jpg
可以看出,发生了split
的regionserver B
,之前应该是只有1个region的, 这样我们重新算一下: sizeToCheck=512M
, 这样就会发生分裂了
而当每个regionserver
上的region
达到2个或以上的时候,一般就不再继续split
,因为此时的阈值在4g
以上了
3. 后续该怎么避免?
最省心的当然是在建表的时候,设置splitPolicy
为DisabledRegionSplitPolicy
, 搭配预分区效果最好
alter 'table', {METADATA => {'SPLIT_POLICY' => 'org.apache.hadoop.hbase.regionserver.DisabledRegionSplitPolicy'}}
如果数据每天增量很大,就不适合这种方式, 需要参照以上split
的分析, 选择region
的个数,splitPolicy
来一起处理
ref: https://issues.apache.org/jira/browse/HBASE-16076?attachmentOrder=asc