Setup for cluster
Add User
sudo addgroup hadoop
sudo adduser --ingroup hadoop hadoop
sudo usermod -a -G sudo hadoop(将hadoop加入sudoers)
Set env
export JAVA_HOME=jdk_path (eg: /usr/lib/jvm/java-6-sun)
export PATH=${JAVA_HOME}/bin:${JAVA_HOME}/jre/bin:${PATH}
export HADOOP_HOME=hadoop_root (eg: export HADOOP_HOME=/usr/local/hadoop)
export PATH=$PATH:$HADOOP_HOME/bin
unalias fs &> /dev/null
alias fs="hadoop fs"
unalias hls &> /dev/null
alias hls="fs -ls"
Hadoop config
- hadoop-env.sh
vi etc/hadoop/hadoop-env.sh
change JAVA_HOME
export JAVA_HOME=dk_path (eg: /usr/lib/jvm/java-6-sun)
- yarn-env.sh
vi etc/hadoop/yarn-env.sh
change JAVA_HOME
export JAVA_HOME=dk_path (eg: /usr/lib/jvm/java-6-sun)
Configuring all machines
- configure all machines
su hadoop
ssh-keygen -t rsa -P ""
cd ~/.ssh
cat id_rsa.pub >> authorized_keys
- modify hosts for all machines
192.168.202.92 master(hostname)
192.168.202.13 slave
Attention: 1. master/slave should be the hostname, because of the mapreduce
use the hostname; 2. remove other binders for master/slave.
127.0.0.1 localhost
#127.0.1.1 sh030 (attention: because this binder, the slave cannot connect to master by sh030:54310)
192.168.202.92 sh030
192.168.202.13 zxx-desktop
192.168.0.62 jack-desktop
- copy master id_rsa.pub to slave authorized_keys
cat id_rsa.pub | ssh hadoop@slave "cat >> /home/hadoop/.ssh/authorized_keys"
- configure master only
cat etc/hadoop/masters
master (hostname)
cat etc/hadoop/slaves ( is used only by the scripts like bin/start-dfs.sh hdfs)
master
slave
Attention: the master / slave should be the same name within the hosts
file
etc/hadoop/*-site.xml for all machines
- core-site.xml
<property>
<name>hadoop.tmp.dir</name>
<value>/data/hadoop</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
- hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>2</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>testHadoop-162:50090</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///data/hdfs/name</value>
</property>
<property>
<name>dfs.namenode.checkpoint.dir</name>
<value>file:///data/hdfs/checkpoint</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///data/hdfs/data</value>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
<property>
<name>dfs.support.append</name>
<value>true</value>
</property>
<property>
<name>dfs.support.broken.append</name>
<value>true</value>
</property>
- yarn-site.xml
<property>
<name>yarn.resourcemanager.address</name>
<value>sh030:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>sh030:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>sh030:8031</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>sh030:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>sh030:8088</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
- mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>master-hadoop:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>master-hadoop:19888</value>
</property>
</configuration>
Formatting the HDFS filesystem
bin/hadoop namenode -format
Attention: If configure from single node to cluster, should delete all the
file in /data/hadoop firstly. Otherwise, the slave datanode cannot launch. rm
-fr /data/hadoop/*
Launch HDFS
./sbin/start-dfs.sh
will launch NameNode SecondaryNameNode DataNode(master also as dataNode)
MapReduce
./sbin/start-yarn.sh
jps on Master
29252 DataNode (Master also as slave)
29940 NodeManager (Master also as slave)
29051 NameNode
29732 ResourceManager
29515 SecondaryNameNode
jps on Slave
27858 DataNode
28116 NodeManager
Stopping the cluster
- MapReduce
./sbin/stop-yarn.sh
- HDFS
./sbin/stop-dfs.sh
Test
- put data
hadoop fs -mkdir /testdata
hadoop fs -put -f ./*.txt /testdata
- mapreduce
hadoop jar ./share/hadoop/mapreduce/sources/hadoop-mapreduce-examples-2.3.0-sources.jar org.apache.hadoop.examples.WordCount /testdata /testdata-output