Hive GenericUDF函数DateDiff源码解析

前言

前面已经介绍过Hive UDF有两种实现方式,其中GenericUDF的方式是比较复杂的一种,为了加深对这种方式的理解,尝试去看了下Hive原生函数的源码,记录如下。新人入门,水平不足,如有错误,欢迎指正。

源码解析

public class GenericUDFDateDiff extends GenericUDF{
    //import java.text.SimpleDateFormat; 声明一个日期格式变量
    private transient SimpleDateFormat formatter=new SimpleDateFormat("yyyy-MM-dd");
    //import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters.Converter;
    //声明两个参数的转换变量,用来判断入参的类型
    private transient Converter inputConverter1;
    private transient Converter inputConverter2;
    //import org.apache.hadoop.io.IntWritable; 声明返回值的类型,IntWritable是Hadoop中实现的用于封装Java数据类型的类
    private IntWritable output=new IntWritable();
    //import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector.PrimitiveCategory;
    //声明两个入参的类型是Hive支持的原始数据类型
    private transient PrimitiveCategory inputType1;
    private transient PrimitiveCategory inputType2;
    private IntWritable result=new IntWritable();

    public GenericUDFDateDiff(){
        //import java.util.TimeZone;
        this.formatter.setTimeZone(TimeZone.getTimeZone("UTC"));
    }
} 

上述代码首先继承了GenericUDF,并且定义了多个接下来会用到的变量。接下来就是重写initialize的代码:

    //import org.apache.hadoop.hive.ql.exec.UDFArgumentException
    public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException{
        //import org.apache.hadoop.hive.ql.exec.UDFArgumentLengthException;
        //进行参数个数检查,如果不是两个参数则抛出异常
        if(arguments.length!=2){
            throw new UDFArgumentLengthException("datediff() requires 2 argument,got "+arguments.length);
        }else{
            //
            this.inputConverter1=this.checkArguments(arguments,0);
            this.inputConverter2=this.checkArguments(arguments,1);
            //import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
            //import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector.PrimitiveCategory;
            //获取两个入参的数据类型
            this.inputType1=((PrimitiveObjectInspector)arguments[0].getPrimitiveCategory();
            this.inputType2=((PrimitiveObjectInspector)arguments[1].getPrimitiveCategory();
            ObjectInspector outputOI=PrimitiveObjectInspectorFactory.writableIntObjectInspector;
            return outputOI;
        }
    }
    

在重写的initialize的代码中,首先做了参数个数的检查,当参数个数不是两个时抛出异常。然后初始化了前面声明的参数类型和参数类型转换变量。

    private Converter checkArguments(ObjectInspector[] arguments,int i) throws UDFArgumentException{
        //import org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException
        //检查入参的类型
        if(arguments[i].getCategory()!=Category.PRIMITIVE){
            throw new UDFArgumentTypeException(0,"Only primitive type arguments are accepted but "+arguments[i].getTypeName()+" is passed. as first arguments");
        }else {
            //获取入参数据类型
            PrimitiveCategory inputType=((PrimitiveObjectInspector)arguments[i]).getPrimitiveCategory();
            Object converter;
            //判断入参的具体数据类型,赋值相应的converter
            switch(inputType){
            case STRING;
            case VARCHAR;
            case CHAR;
                converter=ObjectInspectorConverters.getConverter((PrimitiveObjectInspector)arguments[i],PrimitiveObjectInspectorFactory.writableStringObjectInspector);
                break;
            case TIMESTAMP;
                converter=new TimestampConverter((PrimitiveObjectInspector)arguments[i],PrimitiveObjectInspectorFactory.writableTimestampObjectInspector);
                break;
            case DATE;
                converter=ObjectInspectorConverter.getConverter((PrimitiveObjectInspector)arguments[i],PrimitiveObjectInspectorFactory.writableDateObjectInspector);
                break;
            default;
                throw new UDFArgumentException("DATEDIFF() only take STRING/TIMESTAMP/DATEWRITABLE types as "+ (i+1) +"-th argument,got " inputType);
            }
            return (Converter)converter;
        }
    }

checkArguments方法首先做了入参的类型检查,要求必须是Hive的原生数据类型,否则会抛出异常。然后再分别根据具体的实际数据类型,赋值相应的converter,最后对于非Sting timestamp date 的数据类型,同样抛出异常。

    private Date convertToDate(PrimitiveCategory inputType,Converter converter,DeferredObject argument) throws HiveException{
        assert converter!=null;
        assert argument!=null;
    
        if(argument.get()==null){
            return null;
        }else {
            Date date=new Date();
            switch(inputType){
            case STRING;
            case VARCHAR;
            case CHAR;
                String dateString=converter.convert(argument.get()).toString;
                try{
                    date=this.formatter.parse(dateString);
                    break;
                }catch(ParseException var8){
                    return null;
                }
            case TIMESTAMP;
                Timestamp ts=((TimestampWritable)converter.convert(argument.get()).getTimestamp();
                ((Date)date).setTime(ts.getTime());
                break;
            case DATE;
                DateWritable dw=(DateWritable)converter.convert(argument.get());
                date=dw.get();
                break;
            default;
                throw new UDFArgumentException("TO_DATE() only takes STRING/TIMESTAMP/DATEWRITABLE types,got "+ inputType);
            }
            return (Date)date;
        }
    }

convertToDate方法根据传入的参数类型,相应的converter及参数值,返回’yyyy-MM-dd’格式的Date数据
接下来是重写evaluate方法,如下:

    public String getDisplayString(String[] children) {
        return this.getStandardDisplayString("datediff", children);
    }
    private IntWritable evaluate(Date date,Date date2){
        if(date!=null && date2!=null){
            long diffInMilliSeconds=date.getTime()-date2.getTime();
            this.result.set((int)(diffInMilliSeconds/86400000L));
            return this.result;
        }else{
            return null;
        }
    }
    public IntWritable evaluate(DeferredObject[] arguments) throws HiveException{
        this.output=this.evaluate(this.convertToDate(this.inputType1,this.inputConverter1,argument[0],this.convertToDate(this.inputType2,this.inputConvertert2,arguments[1]));
        return this.output;
    }

先是定义了一个私有的evaluate方法,用来计算两个日期之间的天数差,之后重写了public evaluate方法。

总结

源码阅读下来,感觉源码中对数据类型的定义转换检查做的十分严格,值得再之后的自己开发过程中学习。

    原文作者:风筝flying
    原文地址: https://www.jianshu.com/p/1f69a5574e2a
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞