环境
一台ubuntu 14.04虚拟机。
Hadoop版本:2.6.0。
增加用户
为了隔离Hadoop和其它软件,可以新建一个用户hduser
和用户组hadoop
来专门运行Hadoop:
sudo addgroup hadoop
sudo adduser –ingroup hadoop hduser
配置SSH免秘钥登陆
Hadoop使用SSH管理节点,需要为相关的远程机器和本机配置SSH免密码登陆。
首先,生成SSH秘钥,生成的私钥文件默认位置是~/.ssh/id_rsa.pub
:
ssh-keygen -t rsa -P “”
将私钥写入~/.ssh/authorized_keys
:
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
尝试使用ssh登陆localhost
,应该不再需要输入密码:
ssh localhost
准备依赖软件
确保本机上安装有JDK。安装JDK:
sudo apt-get update
sudo apt-get install default-jdk
确保安装节点上有ssh,且sshd已经运行。安装ssh:
sudo apt-get install ssh
安装rsync:
sudo apt-get install rsync
安装Hadoop
下载Hadoop安装文件:
wget http://mirrors.hust.edu.cn/apache/hadoop/common/hadoop-2.6.0/hadoop-2.6.0.tar.gz
解压缩安装包:
tar xfz hadoop-2.6.0.tar.gz
设置Hadoop根路径环境变量。本次安装路径为/usr/local/hadoop
,后续的一些命令会使用到该路径:
export HADOOP_HOME=/usr/local/hadoop
移动安装包到Hadoop根路径:
sudo mv hadoop-2.6.0 $HADOOP_HOME
将安装目录的属主修改为hduser
:
sudo chown hduser $HADOOP_HOME
配置Hadoop
1. 设置JAVA_HOME
环境变量
通过以下命令,可以获知JAVA_HOME
环境变量的值:
update-alternatives –config java
在本机上显示结果如下:
There is only one alternative in link group java (providing /usr/bin/java): /usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java
Nothing to configure.
那么JAVA_HOME
应该为/jre/bin/java
前面的部分,也即:
/usr/lib/jvm/java-7-openjdk-amd64
2. 配置./bashrc
我打算将相关路径写到.bashrc
下,这样,登陆用户时,就会自动加载。使用VIM编辑.bashrc
:
vim ~/.bashrc
在.bashrc
中添加相关环境变量:
#HADOOP
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
应用环境变量:
source ~/.bashrc
3. 配置$HADOOP_HOME/etc/hadoop/hadoop-env.sh
编辑$HADOOP_HOME/etc/hadoop/hadoop-env.sh
文件,设置JAVA_HOME
:
sudo vim $HADOOP_HOME/etc/hadoop/hadoop-env.sh
找到其中的JAVA_HOME
变量,修改为:
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
4. 配置$HADOOP_HOME/etc/hadoop/core-site.xml
编辑$HADOOP_HOME/etc/hadoop/core-site.xml
文件:
sudo vim $HADOOP_HOME/etc/hadoop/core-site.xml
在<configuration></configuration>
之间加入HDFS的配置(HDFS的端口配置在9000
):
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
5. 配置$HADOOP_HOME/etc/hadoop/yarn-site.xml
编辑$HADOOP_HOME/etc/hadoop/yarn-site.xml
文件:
sudo vim $HADOOP_HOME/etc/hadoop/yarn-site.xml
在<configuration></configuration>
之间加入以下内容:
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
6. 配置$HADOOP_HOME/etc/hadoop/mapred-site.xml
HADOOP_HOME目录下有一个配置模板$HADOOP_HOME/etc/hadoop/mapred-site.xml.template
,先拷贝到$HADOOP_HOME/etc/hadoop/mapred-site.xml
。
cp $HADOOP_HOME/etc/hadoop/mapred-site.xml{.template,}
编辑$HADOOP_HOME/etc/hadoop/mapred-site.xml
文件:
sudo vim $HADOOP_HOME/etc/hadoop/mapred-site.xml
在<configuration></configuration>
之间加入以下内容:
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
7. 准备数据存储目录
假设准备将数据存放在/mnt/hdfs
,方便起见,现将其设为一个环境变量:
export HADOOP_DATA_DIR=/mnt/hdfs
创建DataNode和NameNode的存储目录,同时将这两个文件夹的属主修改为hduser
:
sudo mkdir -p $HADOOP_DATA_DIR/namenode
sudo mkdir -p $HADOOP_DATA_DIR/datanode
sudo chown hduser /mnt/hdfs/namenode
sudo chown hduser /mnt/hdfs/datanode
8. 配置$HADOOP_HOME/etc/hadoop/hdfs-site.xml
编辑$HADOOP_HOME/etc/hadoop/hdfs-site.xml
文件:
sudo vim $HADOOP_HOME/etc/hadoop/hdfs-site.xml
在<configuration></configuration>
之间增加DataNode和NameNode的配置,如下:
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/mnt/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/mnt/hdfs/datanode</value>
</property>
9. 格式化HDFS文件系统
使用下列命令格式化HDFS文件系统:
hdfs namenode -format
启动Hadoop
启动HDFS:
start-dfs.sh
启动yarn:
start-yarn.sh
HDFS和yarn的web控制台默认监听端口分别为50070
和8088
。
如果一切正常,使用jps可以查看到正在运行的Hadoop服务,在我机器上的显示结果为:
29117 NameNode
29675 ResourceManager
29278 DataNode
30002 NodeManager
30123 Jps
29469 SecondaryNameNode
运行Hadoop任务
下面以著名的WordCount例子来说明如何使用Hadoop。
1. 准备程序包
下面是WordCount的源代码。
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
编译代码,并打包:
export HADOOP_CLASSPATH=$JAVA_HOME/lib/tools.jar
bin/hadoop com.sun.tools.javac.Main WordCount.java
jar cf wc.jar WordCount*.class
wc.jar就是打包后的Hadoop Mapreduce程序文件。
2. 准备输入文件
我们的Hadoop Mapreduce程序从HDFS读取输入文件,同时也将输出存放到HDFS中。本文将测试程序的输入目录和输出目录确定为wordcount/input
和wordcount/output
。
在HDFS上创建输入文件夹:
hdfs dfs -mkdir -p wordcount/input
准备一些文本文件作为测试数据,本文准备的两个文件如下:
文件1:input1
The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
The project includes these modules:
Hadoop Common: The common utilities that support the other Hadoop modules.
Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
文件2:input2
Apache Hadoop 2.6.0 is a minor release in the 2.x.y release line, building upon the previous stable release 2.4.1.
Here is a short overview of the major features and improvements.
Common
Authentication improvements when using an HTTP proxy server. This is useful when accessing WebHDFS via a proxy server.
A new Hadoop metrics sink that allows writing directly to Graphite.
Specification work related to the Hadoop Compatible Filesystem (HCFS) effort.
HDFS
Support for POSIX-style filesystem extended attributes. See the user documentation for more details.
Using the OfflineImageViewer, clients can now browse an fsimage via the WebHDFS API.
The NFS gateway received a number of supportability improvements and bug fixes. The Hadoop portmapper is no longer required to run the gateway, and the gateway is now able to reject connections from unprivileged ports.
The SecondaryNameNode, JournalNode, and DataNode web UIs have been modernized with HTML5 and Javascript.
YARN
YARN’s REST APIs now support write/modify operations. Users can submit and kill applications through REST APIs.
The timeline store in YARN, used for storing generic and application-specific information for applications, supports authentication through Kerberos.
The Fair Scheduler supports dynamic hierarchical user queues, user queues are created dynamically at runtime under any specified parent-queue.
将这两个文件拷贝到wordcount/input
:
hdfs dfs -copyFromLocal input* wordcount/input/
3. 运行程序
在Hadoop上执行程序:
hadoop jar wc.jar WordCount wordcount/input wordcount/output
程序的结果在wordcount/output
,查看输出目录:
hdfs dfs -ls wordcount/output
查看输出结果:
hdfs dfs -cat wordcount/output/part-r-00000
参考文献
[1] “HDFS Commands Guide”, http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html#copyFromLocal
[2] “MapReduce Tutorial”, http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
[3] “Hadoop MapReduce Next Generation – Setting up a Single Node Cluster”, http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
[4] “Running Hadoop On Ubuntu Linux (Single-Node Cluster) – Michael G. Noll”, http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
[5] “How to Install Hadoop on Ubuntu 13.10”, https://www.digitalocean.com/community/tutorials/how-to-install-hadoop-on-ubuntu-13-10