关于MapReduce中自定义分组类（三）

2024年1月19日 166次阅读来源: MapReduce

Job类

/**
* Define the comparator that controls which keys are grouped together
* for a single call to
* {@link Reducer#reduce(Object, Iterable,
* org.apache.hadoop.mapreduce.Reducer.Context)}
* @param cls the raw comparator to use
* @throws IllegalStateException if the job is submitted
* @see #setCombinerKeyGroupingComparatorClass(Class)
*/
publicvoid setGroupingComparatorClass(Class<? extends RawComparator> cls
) throws IllegalStateException{
ensureState(JobState.DEFINE);
conf.setOutputValueGroupingComparator(cls);
}

JobConf类
在JobConf类中的setOutputValueGroupingComparator方法：

/**
* Set the user defined {@link RawComparator} comparator for
* grouping keys in the input to the reduce.
*
* This comparator should be provided if the equivalence rules for keys
* for sorting the intermediates are different from those for grouping keys
* before each call to
* {@link Reducer#reduce(Object, java.util.Iterator, OutputCollector, Reporter)}.
*
* For key-value pairs (K1,V1) and (K2,V2), the values (V1, V2) are passed
* in a single call to the reduce function if K1 and K2 compare as equal.
*
* Since {@link #setOutputKeyComparatorClass(Class)} can be used to control
* how keys are sorted, this can be used in conjunction to simulate
* secondary sort on values.
*
* Note: This is not a guarantee of the reduce sort being
* stable in any sense. (In any case, with the order of available
* map-outputs to the reduce being non-deterministic, it wouldn't make
* that much sense.)
*
* @param theClass the comparator class to be used for grouping keys.
* It should implement <code>RawComparator</code>.
* @see #setOutputKeyComparatorClass(Class)
* @see #setCombinerKeyGroupingComparator(Class)
*/
publicvoid setOutputValueGroupingComparator(
Class<? extends RawComparator> theClass){
setClass(JobContext.GROUP_COMPARATOR_CLASS,
theClass,RawComparator.class);
}

ctrl
+O
找到getOutputValueGroupingComparator

/**
* Get the user defined {@link WritableComparable} comparator for
* grouping keys of inputs to the reduce.
*
* @return comparator set by the user for grouping values.
* @see #setOutputValueGroupingComparator(Class) for details.
*/
publicRawComparator getOutputValueGroupingComparator(){
Class<? extends RawComparator> theClass = getClass(
JobContext.GROUP_COMPARATOR_CLASS, null,RawComparator.class);
if(theClass == null){
return getOutputKeyComparator();
}
returnReflectionUtils.newInstance(theClass,this);
}

那么谁调用了getOutputValueGroupingComparator方法
ReduceTask类

在ReduceTask类中：
（这里没有定义属性comparator，因为直接作为返回值接受接好了啊）

RawComparator comparator = job.getOutputValueGroupingComparator();

这里get到的comparator其实就是我们自定义的xxxG
于是查找，哪里用到了comparator

if(useNewApi){
runNewReducer(job, umbilical, reporter, rIter, comparator,
keyClass, valueClass);
}else{
runOldReducer(job, umbilical, reporter, rIter, comparator,
keyClass, valueClass);
}

因为有新旧API之分啊
所以找到该runNewReducer方法：

private<INKEY,INVALUE,OUTKEY,OUTVALUE>void runNewReducer(JobConf job,
final TaskUmbilicalProtocol umbilical,
final TaskReporter reporter,
RawKeyValueIterator rIter,
RawComparator<INKEY> comparator,
Class<INKEY> keyClass,
Class<INVALUE> valueClass
) throws IOException,InterruptedException,
ClassNotFoundException{
// wrap value iterator to report progress.
final RawKeyValueIterator rawIter = rIter;
rIter =newRawKeyValueIterator(){
publicvoid close() throws IOException{
rawIter.close();
}
publicDataInputBuffer getKey() throws IOException{
return rawIter.getKey();
}
publicProgress getProgress(){
return rawIter.getProgress();
}
publicDataInputBuffer getValue() throws IOException{
return rawIter.getValue();
}
public boolean next() throws IOException{
boolean ret = rawIter.next();
reporter.setProgress(rawIter.getProgress().getProgress());
return ret;
}
};
// make a task context so we can get the classes
org.apache.hadoop.mapreduce.TaskAttemptContext taskContext =
new org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl(job,
getTaskID(), reporter);
// make a reducer
org.apache.hadoop.mapreduce.Reducer<INKEY,INVALUE,OUTKEY,OUTVALUE> reducer =
(org.apache.hadoop.mapreduce.Reducer<INKEY,INVALUE,OUTKEY,OUTVALUE>)
ReflectionUtils.newInstance(taskContext.getReducerClass(), job);
org.apache.hadoop.mapreduce.RecordWriter<OUTKEY,OUTVALUE> trackedRW =
newNewTrackingRecordWriter<OUTKEY, OUTVALUE>(this, taskContext);
job.setBoolean("mapred.skip.on", isSkipping());
job.setBoolean(JobContext.SKIP_RECORDS, isSkipping());
org.apache.hadoop.mapreduce.Reducer.Context
reducerContext = createReduceContext(reducer, job, getTaskID(),
rIter, reduceInputKeyCounter,
reduceInputValueCounter,
trackedRW,
committer,
reporter, comparator, keyClass,
valueClass);
try{
reducer.run(reducerContext);
} finally {
trackedRW.close(reducerContext);
}
}

runNewReducer方法接收该comparator参数后传递给了createReduceContext方法
Task类
在Task里面的createReduceContext方法：

@SuppressWarnings("unchecked")
protectedstatic<INKEY,INVALUE,OUTKEY,OUTVALUE>
org.apache.hadoop.mapreduce.Reducer<INKEY,INVALUE,OUTKEY,OUTVALUE>.Context
createReduceContext(org.apache.hadoop.mapreduce.Reducer
<INKEY,INVALUE,OUTKEY,OUTVALUE> reducer,
Configuration job,
org.apache.hadoop.mapreduce.TaskAttemptID taskId,
RawKeyValueIterator rIter,
org.apache.hadoop.mapreduce.Counter inputKeyCounter,
org.apache.hadoop.mapreduce.Counter inputValueCounter,
org.apache.hadoop.mapreduce.RecordWriter<OUTKEY,OUTVALUE> output,
org.apache.hadoop.mapreduce.OutputCommitter committer,
org.apache.hadoop.mapreduce.StatusReporter reporter,
RawComparator<INKEY> comparator,
Class<INKEY> keyClass,Class<INVALUE> valueClass
) throws IOException,InterruptedException{
org.apache.hadoop.mapreduce.ReduceContext<INKEY, INVALUE, OUTKEY, OUTVALUE>
reduceContext =
newReduceContextImpl<INKEY, INVALUE, OUTKEY, OUTVALUE>(job, taskId,
rIter,
inputKeyCounter,
inputValueCounter,
output,
committer,
reporter,
comparator,
keyClass,
valueClass);

ReduceContextImpl类

找到ReduceContextImpl中找到：

publicReduceContextImpl(Configuration conf,TaskAttemptID taskid,
RawKeyValueIterator input,
Counter inputKeyCounter,
Counter inputValueCounter,
RecordWriter<KEYOUT,VALUEOUT> output,
OutputCommitter committer,
StatusReporter reporter,
RawComparator<KEYIN> comparator,
Class<KEYIN> keyClass,
Class<VALUEIN> valueClass
) throws InterruptedException,IOException{
super(conf, taskid, output, committer, reporter);
this.input = input;
this.inputKeyCounter = inputKeyCounter;
this.inputValueCounter = inputValueCounter;
this.comparator = comparator;
this.serializationFactory =newSerializationFactory(conf);
this.keyDeserializer = serializationFactory.getDeserializer(keyClass);
this.keyDeserializer.open(buffer);
this.valueDeserializer = serializationFactory.getDeserializer(valueClass);
this.valueDeserializer.open(buffer);
hasMore = input.next();
this.keyClass = keyClass;
this.valueClass = valueClass;
this.conf = conf;
this.taskid = taskid;
}

在
ReduceContextImpl
类内查找comparator

/**
* Advance to the next key/value pair.
*/
@Override
public boolean nextKeyValue() throws IOException,InterruptedException{
if(!hasMore){
key = null;
value = null;
returnfalse;
}
firstValue =!nextKeyIsSame;
DataInputBuffer nextKey = input.getKey();
currentRawKey.set(nextKey.getData(), nextKey.getPosition(),
nextKey.getLength()- nextKey.getPosition());
buffer.reset(currentRawKey.getBytes(),0, currentRawKey.getLength());
key = keyDeserializer.deserialize(key);
DataInputBuffer nextVal = input.getValue();
buffer.reset(nextVal.getData(), nextVal.getPosition(), nextVal.getLength()
- nextVal.getPosition());
value = valueDeserializer.deserialize(value);
currentKeyLength = nextKey.getLength()- nextKey.getPosition();
currentValueLength = nextVal.getLength()- nextVal.getPosition();
if(isMarked){
backupStore.write(nextKey, nextVal);
}
hasMore = input.next();
if(hasMore){
nextKey = input.getKey();
nextKeyIsSame = comparator.compare(currentRawKey.getBytes(),0,
currentRawKey.getLength(),
nextKey.getData(),
nextKey.getPosition(),
nextKey.getLength()- nextKey.getPosition()
)==0;
}else{
nextKeyIsSame =false;
}
inputValueCounter.increment(1);
returntrue;
}

这个compare方法，调用的是接口
RawComparator中的
public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2);
而一般如Text、IntWritable这些都实现了该方法
（一）未设置

if(theClass == null){
return getOutputKeyComparator();
}

/**
* Get the {@link RawComparator} comparator used to compare keys.
*
* @return the {@link RawComparator} comparator used to compare keys.
*/
publicRawComparator getOutputKeyComparator(){
Class<? extends RawComparator> theClass = getClass(
JobContext.KEY_COMPARATOR, null,RawComparator.class);
if(theClass != null)
returnReflectionUtils.newInstance(theClass,this);
returnWritableComparator.get(getMapOutputKeyClass().asSubclass(WritableComparable.class),this);
}

没有job.setGroupingComparatorClass(xxxG.class);的时候，即使用默认的，调用Map输出的时候的key所属的类中的comparae，比如Text中的
原来默认情况下，调用的是比较器啊（更准确说是那个比较方法）
（这里比较器又分两种：
1 key的类类型中的compareTo方法
2 自定义比较器类中的compare方法
）
无论我们使用1还是2哪种方式，显然，分组和比较要么都用1 ，要么都用2，这样都是同一套规则，显然也不怎么合适。
所以我们一般是在自定义比较器类的同时又自定义分组类
（二）设置了

returnReflectionUtils.newInstance(theClass,this);

如果我们j
ob.setGroupingComparatorClass(xxxG.class),则是创建我们自定义的这个分组类的这个xxxG
这个xxxG得继承
WritableComparator类，复写compare方法
如：
public static class SelfGroupComparator extends WritableComparator{
复写compare方法即可
这样，调用逻辑和compare的一样。

我更推荐方法2
alt+左箭头，返回上一次查看源码的地方

来自为知笔记(Wiz)

    原文作者：MapReduce
    原文地址: https://www.cnblogs.com/xuanlvshu/p/5748428.html
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。