Spark 2.3 on yarn的配置安装

这是一篇spark环境的安装文档,不知道为什么查了下网上的安装步骤总是感觉怪怪的,有把环境变量配置到spark-env.sh的,有配置了yarn然后启动spark-standalone服务的,虽然不能保证我的方法是最标准的,但是至少我觉得比较合理

安装参考

解压

  1. 下载java、scala、hadoop、spark、hive、kafka、python3.x(pyspark使用)压缩包
  2. 解压 tar -zxf xxx.tar.gz tar -zxf xxx.tgz
  3. 添加{xxx}_HOME到全局环境变量
    export {xxx}_HOME=<path> 到 /etc/profile
  4. source /etc/profile 来使得环境变量生效
  5. 附:我的配置
export JAVA_HOME=/opt/jdk1.8.0_161
export SCALA_HOME=/opt/scala-2.11.11
export HADOOP_HOME=/opt/hadoop-2.7.6
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export SPARK_YARN_USER_ENV=${HADOOP_CONF_DIR}
export SPARK_HOME=/opt/spark-2.3.0-bin-hadoop2.7
export HIVE_HOME=/opt/hive-2.3.3-bin
export HIVE_CONF_DIR=${HIVE_HOME}/conf
export PYTHON_HOME=/usr/local/python-3.6.5

export PATH=${JAVA_HOME}/bin:${SCALA_HOME}/bin:${HADOOP_HOME}/bin:${SPARK_HOME}/bin:${HIVE_HOME}/bin:${PYTHON_HOME}/bin:$PATH

Hadoop安装

配置集群互信

  1. 登录各个节点
  2. 创建rsa公钥秘钥
    $ mkdir ~/.ssh
    $ chmod 700 ~/.ssh
    $ cd ~/.ssh
    $ ssh-keygen -t rsa  # 一路回车
  1. 整合公钥
    ssh <host-ip> <command...> 表示ssh到host 并执行后面的command
    $ ssh node-1 cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
    ...
    $ ssh node-N cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
    $ chmod 600 ~/.ssh/authorized_keys
  1. 分发公钥
    scp 是用于在不同主机之间拷贝文件
    $ scp ~/.ssh/authorized_keys node1:~/.ssh/
    ... 
    $ scp ~/.ssh/authorized_keys nodeN:~/.ssh/
  1. 测试, date命令执行成功即表示成功
    ssh node1 date
    ...
    ssh nodeN date
  1. 另:即使只有一个节点也需要配置, 否则需要输入很多次密码
    ssh到自己本机地址测试,如果不需要密码即为设置正确

配置core-site.xml

    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://<hostname>:9000/</value>
    </property>
    <property>
         <name>hadoop.tmp.dir</name>
         <value>file:/data/hdfs/tmp</value>
    </property>
    <property> 
        <name>fs.trash.interval</name>    
        <value>1440</value>    
        <description>Number of minutes between trash checkpoints.    
        If zero, the trash feature is disabled.
        </description>    
    </property>

配置hdfs-site.xml

    <property>
        <name>dfs.replication</name>
        <value>2</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>/data/hdfs/name</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/data/hdfs/data</value>
    </property>
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value><hostname>:9001</value>
    </property> 

配置mapred-site.xml

    <property>
        <name>mapred.job.tracker</name>
        <value><hostname>:9001</value>
    </property>

配置yarn-site.xml

    <configuration>
        <property>
            <name>yarn.nodemanager.aux-services</name>
            <value>mapreduce_shuffle</value>
        </property>
        <property>
            <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
            <value>org.apache.hadoop.mapred.ShuffleHandler</value>
        </property>
        <property>
            <name>yarn.resourcemanager.address</name>
            <value><hostname>:8032</value>
        </property>
        <property>
            <name>yarn.resourcemanager.scheduler.address</name>
            <value><hostname>:8030</value>
        </property>
        <property>
            <name>yarn.resourcemanager.resource-tracker.address</name>
            <value><hostname>:8035</value>
        </property>
        <property>
            <name>yarn.resourcemanager.admin.address</name>
            <value><hostname>:8033</value>
        </property>
        <property>
            <name>yarn.resourcemanager.webapp.address</name>
            <value><hostname>:8088</value>
        </property>
        <property>
            <name>yarn.nodemanager.pmem-check-enabled</name>
            <value>false</value>
        </property>
        <property>
            <name>yarn.nodemanager.vmem-check-enabled</name>
            <value>false</value>
        </property>
       <!-- 以下变量根据自己的环境修改value值,yarn不会根据环境动态识别 -->
        <property>
            <description>Amount of physical memory, in MB, that can be allocated for containers.</description>
            <name>yarn.nodemanager.resource.memory-mb</name>
            <value>3036</value>
        </property>
        <property>
            <description>The minimum allocation for every container request at the RM,
                         in MBs. Memory requests lower than this won't take effect,
                         and the specified value will get allocated at minimum.</description>
            <name>yarn.scheduler.minimum-allocation-mb</name>
            <value>128</value>
        </property>
        <property>
            <description>The maximum allocation for every container request at the RM,
                         in MBs. Memory requests higher than this won't take effect,
                         and will get capped to this value.</description>
            <name>yarn.scheduler.maximum-allocation-mb</name>
            <value>2560</value>
        </property>
    </configuration>

配置yarn-env.sh/ hadoop-env.sh

在yarn-env.sh/ hadoop-env.sh文末配置JAVA_HOME
export JAVA_HOME=/opt/jdk1.8.0_161
即使配置了环境变量也需要,不知道为什么

配置slaves文件

注意要使用内网IP

    slave1
    ...
    slaveN

初始化namenode

hadoop namenode -format

启动

${hadoop_home}/sbin/start-yarn.sh
${hadoop_home}/sbin/start-dfs.sh

Hive安装

安装mysql

见mysql安装文档

下载mysql驱动,复制到hive_home/lib目录下

注意安装的mysql的版本

配置hive-site.xml

  1. 拷贝hive-default.xml.template到hive-site.xml,注意名称变了
  2. 添加配置:
    <property>
        <name>datanucleus.fixedDatastore</name>
        <value>false</value>
    </property>
    <property>
        <name>datanucleus.autoCreateSchema</name>
        <value>true</value>
    </property>
    <property>
        <name>datanucleus.autoCreateTables</name>
        <value>true</value>
    </property>
    <property>
        <name>datanucleus.autoCreateColumns</name>
        <value>true</value>
    </property>
    <property>
        <name>system:java.io.tmpdir</name>
        <value>/tmp</value>
    </property>
    <property>
        <name>system:user.name</name>
        <value>localadmin</value>
    </property>
  1. 修改配置:
    <property>
        <name>hive.metastore.uris</name>
        <value>thrift://<hostname>:9083</value>
        <description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description>
    </property>
    <!-- <your-mysql-password> 替换你的mysql密码 -->
    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value><your-mysql-password></value>
        <description>password to use against metastore database</description>
    </property>
     <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>mysqladmin</value>
        <!-- <value>APP</value> -->
        <description>Username to use against metastore database</description>
      </property>
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:mysql://<hostname>:3306/hive?createDatabaseIfNotExist=true</value>
        <description>
          JDBC connect string for a JDBC metastore.
          To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL.
          For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.
        </description>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.jdbc.Driver</value>
        <!-- <value>org.apache.derby.jdbc.EmbeddedDriver</value> -->
        <description>Driver class name for a JDBC metastore</description>
    </property>
    <!-- 解决一些报错用 -->
    <property>
        <name>hive.metastore.schema.verification</name>
        <value>false</value>
        <description>
          Enforce metastore schema version consistency.
          True: Verify that version information stored in is compatible with one from Hive jars.  Also disable automatic
                schema migration attempt. Users are required to manually migrate schema after Hive upgrade which ensures
                proper metastore schema migration. (Default)
          False: Warn if the version information stored in metastore doesn't match with one from in Hive jars.
        </description>
    </property>
    <property>
        <name>hive.default.fileformat</name>
        <value>Orc</value>
        <!-- <value>TextFile</value> -->
        <description>
          Expects one of [textfile, sequencefile, rcfile, orc].
          Default file format for CREATE TABLE statement. Users can explicitly override it by CREATE TABLE ... STORED AS [FORMAT]
        </description>
    </property>
    <property>
        <name>hive.merge.mapredfiles</name>
        <value>true</value>
        <description>Merge small files at the end of a map-reduce job</description>
    </property>  

启动hive metastore

hive --service metastore &

Spark安装

配置项参考: http://spark.apache.org/docs/2.3.0/configuration.html

添加配置项到spark-default.conf,按需调整

spark.eventLog.enabled           true
spark.eventLog.dir               hdfs://<hostname>:9000/eventLogs
spark.eventLog.compress          true

spark.serializer                 org.apache.spark.serializer.KryoSerializer

spark.master                    yarn 
spark.driver.cores              1
spark.driver.memory             800m 
spark.executor.cores            1
spark.executor.memory           1000m
spark.executor.instances        1

spark.sql.warehouse.dir         hdfs://<hostname>:9000/user/hive/warehouse

配置spark-env.sh

# pyspark需要python 3.x,不用python的不用配置
export PYSPARK_PYTHON=/usr/local/python-3.6.5/bin/python
export PYSPARK_DRIVER_PYTHON=python

配置spark读写hive

拷贝hive_home/conf/hive-site.xml 到spark_home/conf/目录下。
否则,spark读写的hive仓库和hive自己读写的是独立的

启动spark-shell,测试安装是否成功

${spark_home}/bin/spark-shell
如果启动之后,spark / sc两个对象正常生成,即为配置成功

Kafka安装(不用kafka就不装了)

配置config/server.properties

  • log.dirs=/tmp/kafka-logs :kafka的topic,消息等信息的持久化保存的位置,建议换到不是/tmp的目录

配置config/zookeeper.properties

  • dataDir=/tmp/zookeeper :the directory where the snapshot is stored

mutli broker安装

https://kafka.apache.org/quickstart#quickstart_multibroker

启动kafka

bin/zookeeper-server-start.sh config/zookeeper.properties
bin/kafka-server-start.sh config/server.properties

Python3安装

见python3安装

总结

  1. spark-env.sh和spark-default.conf有很多等效配置项,但是我看spark官网给的配置项大多是spark-default风格的,我在修改配置时就尽量只修改spark-default的。
  2. 记下来方便以后再搭环境时使用,当然如果能帮到别人就最好了
  3. 如果有什么问题可以评论区留言,有错就改,知无不言_
    原文作者:祗談風月
    原文地址: https://www.jianshu.com/p/a4ef73428097
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞