这是一篇spark环境的安装文档,不知道为什么查了下网上的安装步骤总是感觉怪怪的,有把环境变量配置到spark-env.sh的,有配置了yarn然后启动spark-standalone服务的,虽然不能保证我的方法是最标准的,但是至少我觉得比较合理
安装参考
- Spark on yarn的安装: http://wuchong.me/blog/2015/04/04/spark-on-yarn-cluster-deploy/
- hive安装 :http://dblab.xmu.edu.cn/blog/install-hive/
解压
- 下载java、scala、hadoop、spark、hive、kafka、python3.x(pyspark使用)压缩包
- 解压
tar -zxf xxx.tar.gz
tar -zxf xxx.tgz
- 添加
{xxx}_HOME
到全局环境变量
export {xxx}_HOME=<path>
到 /etc/profile -
source /etc/profile
来使得环境变量生效 - 附:我的配置
export JAVA_HOME=/opt/jdk1.8.0_161
export SCALA_HOME=/opt/scala-2.11.11
export HADOOP_HOME=/opt/hadoop-2.7.6
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export SPARK_YARN_USER_ENV=${HADOOP_CONF_DIR}
export SPARK_HOME=/opt/spark-2.3.0-bin-hadoop2.7
export HIVE_HOME=/opt/hive-2.3.3-bin
export HIVE_CONF_DIR=${HIVE_HOME}/conf
export PYTHON_HOME=/usr/local/python-3.6.5
export PATH=${JAVA_HOME}/bin:${SCALA_HOME}/bin:${HADOOP_HOME}/bin:${SPARK_HOME}/bin:${HIVE_HOME}/bin:${PYTHON_HOME}/bin:$PATH
Hadoop安装
配置集群互信
- 登录各个节点
- 创建rsa公钥秘钥
$ mkdir ~/.ssh
$ chmod 700 ~/.ssh
$ cd ~/.ssh
$ ssh-keygen -t rsa # 一路回车
- 整合公钥
ssh <host-ip> <command...>
表示ssh到host 并执行后面的command
$ ssh node-1 cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
...
$ ssh node-N cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 600 ~/.ssh/authorized_keys
- 分发公钥
scp 是用于在不同主机之间拷贝文件
$ scp ~/.ssh/authorized_keys node1:~/.ssh/
...
$ scp ~/.ssh/authorized_keys nodeN:~/.ssh/
- 测试, date命令执行成功即表示成功
ssh node1 date
...
ssh nodeN date
- 另:即使只有一个节点也需要配置, 否则需要输入很多次密码
ssh到自己本机地址测试,如果不需要密码即为设置正确
配置core-site.xml
<property>
<name>fs.defaultFS</name>
<value>hdfs://<hostname>:9000/</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/data/hdfs/tmp</value>
</property>
<property>
<name>fs.trash.interval</name>
<value>1440</value>
<description>Number of minutes between trash checkpoints.
If zero, the trash feature is disabled.
</description>
</property>
配置hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/data/hdfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/data/hdfs/data</value>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value><hostname>:9001</value>
</property>
配置mapred-site.xml
<property>
<name>mapred.job.tracker</name>
<value><hostname>:9001</value>
</property>
配置yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value><hostname>:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value><hostname>:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value><hostname>:8035</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value><hostname>:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value><hostname>:8088</value>
</property>
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
<!-- 以下变量根据自己的环境修改value值,yarn不会根据环境动态识别 -->
<property>
<description>Amount of physical memory, in MB, that can be allocated for containers.</description>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>3036</value>
</property>
<property>
<description>The minimum allocation for every container request at the RM,
in MBs. Memory requests lower than this won't take effect,
and the specified value will get allocated at minimum.</description>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>128</value>
</property>
<property>
<description>The maximum allocation for every container request at the RM,
in MBs. Memory requests higher than this won't take effect,
and will get capped to this value.</description>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>2560</value>
</property>
</configuration>
配置yarn-env.sh/ hadoop-env.sh
在yarn-env.sh/ hadoop-env.sh文末配置JAVA_HOME
export JAVA_HOME=/opt/jdk1.8.0_161
即使配置了环境变量也需要,不知道为什么
配置slaves文件
注意要使用内网IP
slave1
...
slaveN
初始化namenode
hadoop namenode -format
启动
${hadoop_home}/sbin/start-yarn.sh
${hadoop_home}/sbin/start-dfs.sh
Hive安装
安装mysql
见mysql安装文档
下载mysql驱动,复制到hive_home/lib目录下
注意安装的mysql的版本
配置hive-site.xml
- 拷贝hive-default.xml.template到hive-site.xml,注意名称变了
- 添加配置:
<property>
<name>datanucleus.fixedDatastore</name>
<value>false</value>
</property>
<property>
<name>datanucleus.autoCreateSchema</name>
<value>true</value>
</property>
<property>
<name>datanucleus.autoCreateTables</name>
<value>true</value>
</property>
<property>
<name>datanucleus.autoCreateColumns</name>
<value>true</value>
</property>
<property>
<name>system:java.io.tmpdir</name>
<value>/tmp</value>
</property>
<property>
<name>system:user.name</name>
<value>localadmin</value>
</property>
- 修改配置:
<property>
<name>hive.metastore.uris</name>
<value>thrift://<hostname>:9083</value>
<description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description>
</property>
<!-- <your-mysql-password> 替换你的mysql密码 -->
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value><your-mysql-password></value>
<description>password to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>mysqladmin</value>
<!-- <value>APP</value> -->
<description>Username to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://<hostname>:3306/hive?createDatabaseIfNotExist=true</value>
<description>
JDBC connect string for a JDBC metastore.
To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL.
For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.
</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<!-- <value>org.apache.derby.jdbc.EmbeddedDriver</value> -->
<description>Driver class name for a JDBC metastore</description>
</property>
<!-- 解决一些报错用 -->
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
<description>
Enforce metastore schema version consistency.
True: Verify that version information stored in is compatible with one from Hive jars. Also disable automatic
schema migration attempt. Users are required to manually migrate schema after Hive upgrade which ensures
proper metastore schema migration. (Default)
False: Warn if the version information stored in metastore doesn't match with one from in Hive jars.
</description>
</property>
<property>
<name>hive.default.fileformat</name>
<value>Orc</value>
<!-- <value>TextFile</value> -->
<description>
Expects one of [textfile, sequencefile, rcfile, orc].
Default file format for CREATE TABLE statement. Users can explicitly override it by CREATE TABLE ... STORED AS [FORMAT]
</description>
</property>
<property>
<name>hive.merge.mapredfiles</name>
<value>true</value>
<description>Merge small files at the end of a map-reduce job</description>
</property>
启动hive metastore
hive --service metastore &
Spark安装
配置项参考: http://spark.apache.org/docs/2.3.0/configuration.html
添加配置项到spark-default.conf,按需调整
spark.eventLog.enabled true
spark.eventLog.dir hdfs://<hostname>:9000/eventLogs
spark.eventLog.compress true
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.master yarn
spark.driver.cores 1
spark.driver.memory 800m
spark.executor.cores 1
spark.executor.memory 1000m
spark.executor.instances 1
spark.sql.warehouse.dir hdfs://<hostname>:9000/user/hive/warehouse
配置spark-env.sh
# pyspark需要python 3.x,不用python的不用配置
export PYSPARK_PYTHON=/usr/local/python-3.6.5/bin/python
export PYSPARK_DRIVER_PYTHON=python
配置spark读写hive
拷贝hive_home/conf/hive-site.xml 到spark_home/conf/目录下。
否则,spark读写的hive仓库和hive自己读写的是独立的
启动spark-shell,测试安装是否成功
${spark_home}/bin/spark-shell
如果启动之后,spark / sc两个对象正常生成,即为配置成功
Kafka安装(不用kafka就不装了)
配置config/server.properties
- log.dirs=/tmp/kafka-logs :kafka的topic,消息等信息的持久化保存的位置,建议换到不是/tmp的目录
配置config/zookeeper.properties
- dataDir=/tmp/zookeeper :the directory where the snapshot is stored
mutli broker安装
https://kafka.apache.org/quickstart#quickstart_multibroker
启动kafka
bin/zookeeper-server-start.sh config/zookeeper.properties
bin/kafka-server-start.sh config/server.properties
Python3安装
见python3安装
总结
- spark-env.sh和spark-default.conf有很多等效配置项,但是我看spark官网给的配置项大多是spark-default风格的,我在修改配置时就尽量只修改spark-default的。
- 记下来方便以后再搭环境时使用,当然如果能帮到别人就最好了
- 如果有什么问题可以评论区留言,有错就改,知无不言_