说明:
由于线上业务kylin的cube越来越多,数据量随着时间也在增长,构建时间会托的越来越长(同时跑的任务越多,mr时间越长,所以对同时跑的mr数量,我们进行了限制)。
这影响了数据的可用时间。目前需求是有看到近1个小时内的数据,而不再是早期的T-1。
为此我们做了3点优化:
一、是把自动构建的脚本进行了变更,当天第一次构建是构建,第二次是重新构建当天的(为的是包含当天的最新数据)。
二、当天首次构建时,把昨天的也重新构建一次(防止昨天最后一次构建时,最后几十分钟内的数据,没有构建进去)。
三、构建时间间隔调整成10分钟-30分钟不等,缩短时间间隔,加快数据可见时间。
四、把kylin的构建引擎从mr换成spark。有效提升了构建速度。加快了数据可见时间。
本篇博客主要说的是第四点如何配置实现。
1、在kylin目录下新建hadoop_conf文件夹
2、(集群的配置文件关联到kylin的目录)配置文件
ln -s /etc/hadoop/conf/hdfs-site.xml $KYLIN_HOME/hadoop-conf/hdfs-site.xml
ln -s /etc/hadoop/conf/yarn-site.xml $KYLIN_HOME/hadoop-conf/yarn-site.xml
ln -s /etc/hadoop/conf/core-site.xml $KYLIN_HOME/hadoop-conf/core-site.xml
ln -s /etc/hbase/conf/hbase-site.xml $KYLIN_HOME/hadoop-conf/hbase-site.xml
ln -s /etc/hive/conf/hive-site.xml $KYLIN_HOME/hadoop-conf/hive-site.xml
2、修改kylin配置文件
##kylin.properties:
kylin.env.hadoop-conf-dir=/usr/local/apache-kylin-2.1.0-bin-hbase1x/hadoop-conf
3、跑spark依赖的jar添加到hdfs(这样就不用每次跑的时候都上传一次)
jar cv0f spark-libs.jar -C $KYLIN_HOME/spark/jars/ .
hadoop fs -mkdir -p /kylin/spark/
hadoop fs -put spark-libs.jar /kylin/spark/
3.1、对应第4部分的配置如下:
##kylin.properties:After do that, the config in kylin.properties will be:
#kylin.engine.spark-conf.spark.yarn.archive=hdfs://sandbox.hortonworks.com:8020/kylin/spark/spark-libs.jar
kylin.engine.spark-conf.spark.yarn.archive=hdfs://nameservice1:8020/kylin/spark/spark-libs.jar
4、=======kylin.properties配置文件中的spark引擎部分的配置(含优化参数)=====================================================
#### SPARK ENGINE CONFIGS ###
#
## Hadoop conf folder, will export this as “HADOOP_CONF_DIR” to run spark-submit
## This must contain site xmls of core, yarn, hive, and hbase in one folder
##kylin.env.hadoop-conf-dir=/etc/hadoop/conf
kylin.env.hadoop-conf-dir=/usr/local/apps/apache-kylin-2.2.0-bin/hadoop-conf
#
## Estimate the RDD partition numbers
#kylin.engine.spark.rdd-partition-cut-mb=10
kylin.engine.spark.rdd-partition-cut-mb=100
#
## Minimal partition numbers of rdd
#kylin.engine.spark.min-partition=1
#
## Max partition numbers of rdd
#kylin.engine.spark.max-partition=5000
#
## Spark conf (default is in spark/conf/spark-defaults.conf)
kylin.engine.spark-conf.spark.master=yarn
kylin.engine.spark-conf.spark.submit.deployMode=cluster
kylin.engine.spark-conf.spark.yarn.executor.memoryOverhead=1024 # 针对yarn有时候会内存超限,预留的
kylin.engine.spark-conf.spark.yarn.driver.memoryOverhead=256
kylin.engine.spark-conf.spark.yarn.queue=default
kylin.engine.spark-conf.spark.executor.memory=9G
#kylin.engine.spark-conf.spark.executor.cores=2
kylin.engine.spark-conf.spark.executor.cores=2
kylin.engine.spark-conf.spark.executor.instances=9
kylin.engine.spark-conf.spark.storage.memoryFraction=0.5
#kylin.engine.spark-conf.spark.shuffle.memoryFraction=0.3
#kylin.engine.spark-conf.spark.default.parallelism=9
kylin.engine.spark-conf.spark.eventLog.enabled=true
kylin.engine.spark-conf.spark.eventLog.dir=hdfs\:///kylin/spark-history
kylin.engine.spark-conf.spark.history.fs.logDirectory=hdfs\:///kylin/spark-history
#kylin.engine.spark-conf.spark.hadoop.yarn.timeline-service.enabled=false
#
## manually upload spark-assembly jar to HDFS and then set this property will avoid repeatedly uploading jar at runtime
kylin.engine.spark-conf.spark.yarn.archive=hdfs://nameservice1:8020/kylin/spark/spark-libs.jar
##kylin.engine.spark-conf.spark.io.compression.codec=org.apache.spark.io.SnappyCompressionCodec
#
## uncomment for HDP
##kylin.engine.spark-conf.spark.driver.extraJavaOptions=-Dhdp.version=current
##kylin.engine.spark-conf.spark.yarn.am.extraJavaOptions=-Dhdp.version=current
##kylin.engine.spark-conf.spark.executor.extraJavaOptions=-Dhdp.version=current
#
#
#### QUERY PUSH DOWN ###
5、关于上图标记出来的spark的相关参数的优化配置,请参照下面2篇博客,根据自身集群和业务特征进行优化。
参考
http://blog.csdn.net/u010936936/article/details/78095165
http://kylin.apache.org/docs21/tutorial/cube_spark.html
6、实际优化效果
按照每天或者是几个小时进行增量构建的cube,构建速度有大约三倍的提升。
全量构建时,速度提升很少。