HA原理与搭建

HA

今天的主要内容

  • HDFS High Availability Using the Quorum Journal Manager原理了解

  • HA 的搭建

  • 总结zookeeper常用的几个场景

    • hdfs高可用

    • RM高可用

    • Hbase强依赖于zookeeper

    • kafka强依赖于zookeeper

一、HDFS High Availability

HA的介绍

1. Background(背景)

单点故障问题
  • Prior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in an HDFS cluster.(hadoop2.0以前Namenode存在单点故障)

  • Each cluster had a single NameNode, and if that machine or process became unavailable, the cluster as a whole would be unavailable until the NameNode was either restarted or brought up on a separate machine.(一个集群中只有一个NameNode,当NameNode宕后,整个集群将unavailable)

单点故障的解决
  • The HDFS High Availability feature addresses the above problems by providing the option of running two redundant NameNodes in the same cluster in an Active/Passive configuration with a hot standby. (hadoop通过在一个集群中建立active/standby两个namenode来热备(hot standby),来解决单点故障问题)

  • This allows a fast failover to a new NameNode in the case that a machine crashes, or a graceful administrator-initiated failover for the purpose of planned maintenance.(HA实现故障的自动转移)

2. Architecture(架构)

  • In a typical HA cluster, two separate machines are configured as NameNodes.(两个不同的机器被指定为NameNodes)

  • At any point in time, exactly one of the NameNodes is in an Active state, and the other is in a Standby state. (在任何时刻,只能有一个处于Active,另一个则处在StandBy)

  • The Active NameNode is responsible for all client operations in the cluster, while the Standby is simply acting as a slave, maintaining enough state to provide a fast failover if necessary.(只有Active的namenode处理client的请求,standby的namenode则相当于一个slave,只是做好足够的备份,等待active的namenode宕后,才可工作)

  • In order for the Standby node to keep its state synchronized with the Active node, both nodes communicate with a group of separate daemons called “JournalNodes” (JNs). (为了能使两个namenodes达到状态的同步,引入了JournalNode)

  • When any namespace modification is performed by the Active node, it durably logs a record of the modification to a majority of these JNs. (Active将log写到JournalNode中,即active的namenode和standby的namenode共享Fsimage和edits)(此一项确保了两个Namenode元数据的同步)

  • Note that, in an HA cluster, the Standby NameNode also performs checkpoints of the namespace state, and thus it is not necessary to run a Secondary NameNode, CheckpointNode, or BackupNode in an HA cluster. (Standby 的namenode还充当着checkpoint的工作,定期从journalnode进行Fsimage和edit的拉取并进行合并。)

  • The Standby node is capable of reading the edits from the JNs, and is constantly watching them for changes to the edit log. As the Standby Node sees the edits, it applies them to its own namespace.

  • In the event of a failover, the Standby will ensure that it has read all of the edits from the JounalNodes before promoting itself to the Active state. This ensures that the namespace state is fully synchronized before a failover occurs.(在standby切换为active之前,standby在JournalNode读取edit,以确保自己有最新的edit)

  • In order to provide a fast failover, it is also necessary that the Standby node have up-to-date information regarding the location of blocks in the cluster. In order to achieve this, the DataNodes are configured with the location of both NameNodes, and send block location information and heartbeats to both.(为了实现快速的故障转移,DataNode必须同时向两个NameNode汇报BlockLocation信息和心跳信息)

  • It is vital for the correct operation of an HA cluster that only one of the NameNodes be Active at a time. Otherwise, the namespace state would quickly diverge between the two, risking data loss or other incorrect results.(必须确保同一时间只有一个namenode是active状态)

  • In order to ensure this property and prevent the so-called “split-brain scenario,” the JournalNodes will only ever allow a single NameNode to be a writer at a time.(为了防止脑裂,JournalNode在同一时间只允许一个NameNode进行写操作)

  • During a failover, the NameNode which is to become active will simply take over the role of writing to the JournalNodes, which will effectively prevent the other NameNode from continuing in the Active state, allowing the new Active to safely proceed with failover.

3. Hardware resources

  • NameNode machines

    • the machines on which you run the Active and Standby NameNodes should have equivalent hardware to each other, and equivalent hardware to what would be used in a non-HA cluster.
  • JournalNode machines

    • the machines on which you run the JournalNodes.

    • The JournalNode daemon is relatively lightweight, so these daemons may reasonably be collocated on machines with other Hadoop daemons, for example NameNodes, the JobTracker, or the YARN ResourceManager. (journalNode是轻量级的,因此可以和其他进程共用节点)

    • Note: There must be at least 3 JournalNode daemons, since edit log modifications must be written to a majority of JNs. (注意:必须是至少3个JNs)

    • This will allow the system to tolerate the failure of a single machine. You may also run more than 3 JournalNodes, but in order to actually increase the number of failures the system can tolerate, you should run an odd number of JNs, (i.e. 3, 5, 7, etc.). Note that when running with N JournalNodes, the system can tolerate at most (N – 1) / 2 failures and continue to function normally.

4. 以namenode节点的状态控制角度来阐述HA

《HA原理与搭建》 image

  • 利用Zookeeper实现故障的自动转移,zookeeper中保存了两个NameNode的状态信息,即active还是standby;真正的元数据是保存在JN上的

  • 两个NameNode节点上均运行ZKFController进程,实时的监测其所在的NameNode节点的状态,同时具有切换NameNode状态的功能,切换时调用的hadoop的API

  • ZKFController运行在nameNode的本地,不存在网络I/O,使得可以实时的故障切换

  • 当active的NameNode宕后,Monitor Health Controller会通知Zookeeper,zookeeper会设置标记:active的namenode已经宕了,standby的namenode准备启动

  • 以上实现了Namenode的状态的切换

5. 以NameNode数据间的同步来阐述HA

《HA原理与搭建》 image

  • 成功实现故障自动转移后,可使得状态成功切换,但是为了保证新的active可以提供服务,还需要保证:两个NameNode中数据一致。

数据同步分为两个方向

① 加快下次namenode的启动速度:standby同时肩负SecondaryNameNode的功能:定期拉取fsimage和edit进行合并

② 快速实现高可用:实现快速故障转移

  • 两个NameNode的数据的元数据信息要及时同步

    • 引入了JournalNode:用来同步NN1和NN2之间数据状态:即元数据:fsimage和edit

    • 且JournalNode至少三台,且最好奇数台

    • Active的NN1定期检查是否有更新的fsimage,若有,则从JN上拉取并替换掉旧的fsimage

    • standby的NN2肩负着checkpoint

      • checkpoint触发条件:

        • 时间

        • 容量

      • 当触发checkpoint时,NN2会去JournalNode上进行数据的拉取并合并

  • 两个NameNode要同时持有所有的DataNode的状态信息

    • 必须保证:DataNode同时向NN1和NN2汇报状态信息,但是只接受active的调度,不接受standby的调度

二、HA的搭建

hdfs-site.xml配置

1. dfs.nameservices
  • the logical name for this new nameservice

  • The name you choose is arbitrary. It will be used both for configuration and as the authority component of absolute HDFS paths in the cluster.(取名字任意)

      <property>
        <name>dfs.nameservices</name>
        <value>mycluster</value>
      </property>
    
2. dfs.ha.namenodes.[nameservice ID]
  • unique identifiers for each NameNode in the nameservice

  • This will be used by DataNodes to determine all the NameNodes in the cluster. (为两个namenode分别起名)

      <property>
        <name>dfs.ha.namenodes.mycluster</name>
        <value>nn1,nn2</value>
      </property>
    
  • 在这儿将两个namenode起名为nn1,nn2

3. dfs.namenode.rpc-address.[nameservice ID].[name node ID]
  • the fully-qualified RPC address for each NameNode to listen on(设置两个namenode的监听端口)

      <property>
        <name>dfs.namenode.rpc-address.mycluster.nn1</name>
        <value>master:9000</value>
      </property>
      <property>
        <name>dfs.namenode.rpc-address.mycluster.nn2</name>
        <value>slave1:9000</value>
      </property>
    
4. dfs.namenode.http-address.[nameservice ID].[name node ID]
  • the fully-qualified HTTP address for each NameNode to listen on(WEB端口的设置)

      <property>
        <name>dfs.namenode.http-address.mycluster.nn1</name>
        <value>master:50070</value>
      </property>
      <property>
        <name>dfs.namenode.http-address.mycluster.nn2</name>
        <value>slave1:50070</value>
      </property>
    
5.dfs.namenode.shared.edits.dir
  • the URI which identifies the group of JNs where the NameNodes will write/read edits(标记JNs组)

      <property>
        <name>dfs.namenode.shared.edits.dir</name>
        <value>qjournal://slave1:8485;slave2:8485;slave3:8485/mycluster</value>
      </property>
    
  • 此时journalnode为master.slave1,slave2

  • the default port for the JournalNode is 8485

6. dfs.client.failover.proxy.provider.[nameservice ID]
  • the Java class that HDFS clients use to contact the Active NameNode

      <property>
        <name>dfs.client.failover.proxy.provider.mycluster</name>
        <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
      </property>
    
  • The two implementations which currently ship with Hadoop are the ConfiguredFailoverProxyProvider and the RequestHedgingProxyProvider (which, for the first call, concurrently invokes all namenodes to determine the active one, and on subsequent requests, invokes the active namenode until a fail-over happens), so use one of these unless you are using a custom proxy provider.(属性值在ConfiguredFailoverProxyProvider和RequestHedgingProxyProvider中选一个即可)

7. dfs.ha.fencing.methods
  • a list of scripts or Java classes which will be used to fence the Active NameNode during a failover

  • It is desirable for correctness of the system that only one NameNode be in the Active state at any given time.(只允许一个namnode处于active状态)

  • Importantly, when using the Quorum Journal Manager, only one NameNode will ever be allowed to write to the JournalNodes, so there is no potential for corrupting the file system metadata from a split-brain scenario.(只允许一个namnode向JournalNode执行写操作)

  • There are two methods which ship with Hadoop: shell and sshfence.(hadoop中fencing只有两个值:shell和sshfence)

  • 在这里我们选择sshfence方式

    • sshfence – SSH to the Active NameNode and kill the process

    • 因为选择sshfence方式,因此必须配置dfs.ha.fencing.ssh.private-key-files属性

        <property>
          <name>dfs.ha.fencing.methods</name>
          <value>sshfence</value>
        </property>
        
        <property>
          <name>dfs.ha.fencing.ssh.private-key-files</name>
          <value>/home/exampleuser/.ssh/id_rsa</value>
        </property>
      
8. dfs.journalnode.edits.dir
  • the path where the JournalNode daemon will store its local state

      <property>
        <name>dfs.journalnode.edits.dir</name>
        <value>/home/journal/node/local/data</value>
      </property>
    
9. dfs.ha.automatic-failover.enabled
  • 故障自动切换开启

      <property>
         <name>dfs.ha.automatic-failover.enabled</name>
         <value>true</value>
       </property>
    

core-site.xml配置

1. fs.defaultFS
  • the default path prefix used by the Hadoop FS client when none is given

  • 将默认文件系统改为mycluster

      <property>
        <name>fs.defaultFS</name>
        <value>hdfs://mycluster</value>
      </property>
    
2. ha.zookeeper.quorum
  • 配置提供zookeeper集群服务的机器

      <property>
          <name>ha.zookeeper.quorum</name>
          <value>master:2181,slave1:2181,slave2:2181,slave3:2181,standbymaster:2181</value>
      </property>
    

三、RM高可用

1. RM高可用原理讲解

《HA原理与搭建》 image

  • ResourceManager HA is realized through an Active/Standby architecture

  • at any point of time, one of the RMs is Active, and one or more RMs are in Standby mode waiting to take over should anything happen to the Active. The trigger to transition-to-active comes from either the admin (through CLI) or through the integrated failover-controller when automatic-failover is enabled.(一个或者多个standby)

  • The RMs have an option to embed the Zookeeper-based ActiveStandbyElector to decide which RM should be the Active.

  • When the Active goes down or becomes unresponsive, another RM is automatically elected to be the Active which then takes over.

  • Note that, there is no need to run a separate ZKFC daemon as is the case for HDFS because ActiveStandbyElector embedded in RMs acts as a failure detector and a leader elector instead of a separate ZKFC deamon.(注意,不需要像HDFS那样运行单独的ZKFC守护进程,因为嵌入在RM中的ActiveStandbyElector充当故障检测器和领导选举器,而不是单独的ZKFC deamon。)

2. RM高可用搭建

<!--开启RM高可用-->
<property>
  <name>yarn.resourcemanager.ha.enabled</name>
  <value>true</value>
</property>

<!--配置clusterid-->
<property>
  <name>yarn.resourcemanager.cluster-id</name>
  <value>cluster1</value>
</property>

<!--配置节点名称-->
<property>
  <name>yarn.resourcemanager.ha.rm-ids</name>
  <value>rm1,rm2</value>
</property>

<!--配置resourceManager1主机名-->
<property>
  <name>yarn.resourcemanager.hostname.rm1</name>
  <value>master1</value>
</property>

<!--配置resourceManager2主机名-->
<property>
  <name>yarn.resourcemanager.hostname.rm2</name>
  <value>master2</value>
</property>

<!--配置resourceManager1的web地址-->
<property>
  <name>yarn.resourcemanager.webapp.address.rm1</name>
  <value>master1:8088</value>
</property>

<!--配置resourceManager2的web地址-->
<property>
  <name>yarn.resourcemanager.webapp.address.rm2</name>
  <value>master2:8088</value>
</property>

<!--配置zookeeper-->
<property>
  <name>yarn.resourcemanager.zk-address</name>
  <value>zk1:2181,zk2:2181,zk3:2181</value>
</property>

三、总结zookeeper常用的几个场景

现如今依赖Zookeeper的有:

①Hbase强依赖于Zookeeper
  • Zookeeper中存储了HMaster和RegionServer的状态信息

  • HBase中数据的元数据寻址地址,也就是-ROOT-表的地址,即-ROOT-表存在于哪个RegionServer上

②Hdfs高可用
  • zookeeper中存储两个NameNode的状态(Active/Standby)
③RM高可用
  • zookeeper中存储两个ResourceManager的状态(Active/Standby)
④kafka强依赖于Zookeeper
  • zookeeper在kafka集群中的作用

    • kafka集群中的broker状态信息

      • 当其中一个broker宕后,leader对应的follower会升级为leader,这是借助zookeeper完成的
    • 消费者消费消息的状态信息

      • 消费者消费到了partition中的什么位置的消息
点赞