ZooKeeper 源码分析 Leader/Follower 启动, Leader 选举, Leader/Follower 建立 (基于3.4.6)

2019年6月9日 312次阅读

1. ZooKeeper Leader/Follower 启动, Leader 选举, Leader/Follower 建立概述

先看一下下面这张图:

《ZooKeeper 源码分析 Leader/Follower 启动, Leader 选举, Leader/Follower 建立 (基于3.4.6)》 zookeeper启动.png

上面这张图片有点大, 建议在百度云里面进行下载预览, 接下来我们会一步一步进行下去
PS: 吐槽一下简书的图片系统, 图片一旦大了就预览出问题(不清晰)

2. QuorumPeerMain 解析配置文件构建 QuorumPeer

下面的代码主要是从配置文件中获取配置信息, 构建 QuorumPeer

// 根据 配置 QuorumPeerConfig 来启动  QuorumPeer
public void runFromConfig(QuorumPeerConfig config) throws IOException {
    LOG.info("QuorumPeerConfig : " + config);
  try {
      ManagedUtil.registerLog4jMBeans();
  } catch (JMException e) {
      LOG.warn("Unable to register log4j JMX control", e);
  }

  LOG.info("Starting quorum peer");
  try {                                                                         // 1. 在 ZooKeeper 集群中, 每个 QuorumPeer 代表一个 服务
      ServerCnxnFactory cnxnFactory = ServerCnxnFactory.createFactory();
      cnxnFactory.configure(config.getClientPortAddress(), config.getMaxClientCnxns());

      quorumPeer = new QuorumPeer();
      quorumPeer.setClientPortAddress(config.getClientPortAddress());
      quorumPeer.setTxnFactory(new FileTxnSnapLog(                              // 2. 设置 FileTxnSnapLog(这个类包裹 TxnLog, SnapShot)
              new File(config.getDataLogDir()),
              new File(config.getDataDir())));
      quorumPeer.setQuorumPeers(config.getServers());                           // 3. 集群中所有机器
      quorumPeer.setElectionType(config.getElectionAlg());                      // 4. 设置集群 Leader 选举所使用的的算法(默认值 3, 代表 FastLeaderElection)
      quorumPeer.setMyid(config.getServerId());                                 // 5. 每个 QuorumPeer 设置一个 myId 用于区分集群中的各个节点
      quorumPeer.setTickTime(config.getTickTime());
      quorumPeer.setMinSessionTimeout(config.getMinSessionTimeout());           // 6. 客户端最小的 sessionTimeout 时间(若不设置的话, 就是 tickTime * 2)
      quorumPeer.setMaxSessionTimeout(config.getMaxSessionTimeout());           // 7. 客户端最小的 sessionTimeout 时间(若不设置的话, 就是 tickTime * 20)
      quorumPeer.setInitLimit(config.getInitLimit());                           // 8. 最常用的就是 initLimit * tickTime, getEpochToPropose(等待集群中所有节点的 Epoch值 ) waitForEpochAck(在 Leader 建立过程中, Leader 会向所有节点发送 LEADERINFO, 而Follower 节点会回复ACKEPOCH) waitForNewLeaderAck(在 Leader 建立的过程中, Leader 会向 Follower 发送 NEWLEADER, waitForNewLeaderAck 就是等候所有Follower 回复对应的 ACK 值)
      quorumPeer.setSyncLimit(config.getSyncLimit());                           // 9. 常用方法 self.tickTime * self.syncLimit 用于限制集群中各个节点相互连接的 socket 的soTimeout
      quorumPeer.setQuorumVerifier(config.getQuorumVerifier());                 // 10.投票方法, 默认超过半数就通过 (默认值 QuorumMaj)
      quorumPeer.setCnxnFactory(cnxnFactory);                                   // 11.设置集群节点接收client端连接使用的 nioCnxnFactory(用 基于原生 java nio, netty nio) (PS 在原生 NIO 的类中发现代码中没有处理 java nio CPU 100% 的bug)
      quorumPeer.setZKDatabase(new ZKDatabase(quorumPeer.getTxnFactory()));     // 12.设置 ZKDataBase
      quorumPeer.setLearnerType(config.getPeerType());                          // 13.设置节点的类别 (参与者/观察者)
      quorumPeer.setSyncEnabled(config.getSyncEnabled());                       // 14.这个参数主要用于 (Observer Enables/Disables sync request processor. This option is enable by default and is to be used with observers.) 就是 Observer 是否使用 SyncRequestProcessor
      quorumPeer.setQuorumListenOnAllIPs(config.getQuorumListenOnAllIPs());

      quorumPeer.start();                                                       // 15.开启服务
      LOG.info("quorumPeer.join begin");
      quorumPeer.join();                                                        // 16.等到 线程 quorumPeer 执行完成, 程序才会继续向下再执行, 详情见方法注解 (Waits for this thread to die.)
      LOG.info("quorumPeer.join end");
  } catch (InterruptedException e) {
      // warn, but generally this is ok
      LOG.warn("Quorum Peer interrupted", e);
  }
}

3. QuorumPeer 启动

主要是加载数据到DataTree, 开启监听客户端连接, 开启Leader选举, 最后程序会在 QuorumPeer.run() 的while loop 里面

public synchronized void start() {
    loadDataBase();           // 从SnapShot，TxnFile 加载数据到 DataTree
    cnxnFactory.start();      // 开启服务端的 端口监听
    startLeaderElection();    // 开启 Leader 选举线程
    super.start();            // 这一步 开启 Thread.run() 方法
}

4. QuorumPeer.loadDataBase

从 snapshot/TxnLog里面加载数据到DataTree里面

  // 经过下面的操作, 就会存在 currentEpoch, acceptEpoch 文件, 并且 DataTree 文件也会进行加载
  private void loadDataBase() {
      File updating = new File(getTxnFactory().getSnapDir(),                // 1. 在 snap shot 文件目录下面有对应的 updateEpoch 文件
                               UPDATING_EPOCH_FILENAME);
  try {
          zkDb.loadDataBase();                                              // 2. 从 snapshot, TxnLog 里面加载出 dataTree 及 sessionsWithTimeouts

          // load the epochs
          long lastProcessedZxid = zkDb.getDataTree().lastProcessedZxid;    // 3. 获取 zkDb 对应的处理过的 最新的一个 zxid 的值
      long epochOfZxid = ZxidUtils.getEpochFromZxid(lastProcessedZxid); // 4. 将 zxid 的高 32 位当做 epoch 值, 低 32 位才是 zxid
          try {
            currentEpoch = readLongFromFile(CURRENT_EPOCH_FILENAME);      // 5. 从文件中加载 epoch 值 (若不存在 currentEpoch 文件, 则直接在 catch 中执行代码, 而且一般都是这样)
              if (epochOfZxid > currentEpoch && updating.exists()) {        // 6. 此处说明 QuorumPeer 在进行 takeSnapShot 后, 进程直接挂了, 还没来得及更新 currentEpoch
                  LOG.info("{} found. The server was terminated after " +
                           "taking a snapshot but before updating current " +
                           "epoch. Setting current epoch to {}.",
                           UPDATING_EPOCH_FILENAME, epochOfZxid);
                  setCurrentEpoch(epochOfZxid);
                  if (!updating.delete()) {
                      throw new IOException("Failed to delete " +
                                            updating.toString());
                  }
              }
          } catch(FileNotFoundException e) {
            // pick a reasonable epoch number
            // this should only happen once when moving to a
            // new code version
            currentEpoch = epochOfZxid;                                    // 7. 遇到的是 currentEpoch 文件不存在, 直接运行到这里了
            LOG.info(CURRENT_EPOCH_FILENAME
                    + " not found! Creating with a reasonable default of {}. This should only happen when you are upgrading your installation",
                    currentEpoch);
            writeLongToFile(CURRENT_EPOCH_FILENAME, currentEpoch);
          }
          if (epochOfZxid > currentEpoch) {
            throw new IOException("The current epoch, " + ZxidUtils.zxidToString(currentEpoch) + ", is older than the last zxid, " + lastProcessedZxid);
          }
          try {
            acceptedEpoch = readLongFromFile(ACCEPTED_EPOCH_FILENAME);     // 8. 从文件中读取当前接收到的 epoch 值
          } catch(FileNotFoundException e) {
            // pick a reasonable epoch number
            // this should only happen once when moving to a
            // new code version
            acceptedEpoch = epochOfZxid;                                   // 9. 当从 acceptEpoch 文件里面读取数据失败时, 就直接运行这边的代码
            LOG.info(ACCEPTED_EPOCH_FILENAME
                    + " not found! Creating with a reasonable default of {}. This should only happen when you are upgrading your installation",
                    acceptedEpoch);
            writeLongToFile(ACCEPTED_EPOCH_FILENAME, acceptedEpoch);       // 10 将 acceptEpoch 值直接写入到对应的文件里面
          }
          if (acceptedEpoch < currentEpoch) {
            throw new IOException("The current epoch, " + ZxidUtils.zxidToString(currentEpoch) + " is less than the accepted epoch, " + ZxidUtils.zxidToString(acceptedEpoch));
          }
      } catch(IOException ie) {
          LOG.error("Unable to load database on disk", ie);
          throw new RuntimeException("Unable to run quorum server ", ie);
      }
}

5. QuorumPeer.startLeaderElection

创建 Leader 选举的算法及开启 QuorumCnxManager.Listener 监听集群中的节点相互连接

// 开启 leader 的选举操作
synchronized public void startLeaderElection() {
  try {
    currentVote = new Vote(myid, getLastLoggedZxid(), getCurrentEpoch());  // 1. 生成投给自己的选票
  } catch(IOException e) {
    RuntimeException re = new RuntimeException(e.getMessage());
    re.setStackTrace(e.getStackTrace());
    throw re;
  }
    for (QuorumServer p : getView().values()) {                                // 2. 获取集群里面的所有的机器
        if (p.id == myid) {
            myQuorumAddr = p.addr;
            break;
        }
    }
    if (myQuorumAddr == null) {
        throw new RuntimeException("My id " + myid + " not in the peer list");
    }
    if (electionType == 0) {                                                   // 3. 现在默认的选举算法是 FastLeaderElection
        try {
            udpSocket = new DatagramSocket(myQuorumAddr.getPort());
            responder = new ResponderThread();
            responder.start();
        } catch (SocketException e) {
            throw new RuntimeException(e);
        }
    }
    this.electionAlg = createElectionAlgorithm(electionType);                  // 4. 创建 Election
}

protected Election createElectionAlgorithm(int electionAlgorithm){
    Election le=null;
            
    //TODO: use a factory rather than a switch
    switch (electionAlgorithm) {
    case 0:
        le = new LeaderElection(this);
        break;
    case 1:
        le = new AuthFastLeaderElection(this);
        break;
    case 2:
        le = new AuthFastLeaderElection(this, true);
        break;
    case 3:                                                                 // 1. 默认的 leader 选举的算法
        qcm = new QuorumCnxManager(this);
        QuorumCnxManager.Listener listener = qcm.listener;                  // 2. 等待集群中的其他的机器进行连接
        if(listener != null){
            listener.start();
            le = new FastLeaderElection(this, qcm);
        } else {
            LOG.error("Null listener when initializing cnx manager");
        }
        break;
    default:
        assert false;
    }
    return le;
}

Listener 在监听到有其他节点连接上, 则进行相应的处理

6. QuorumCnxManager.Listener

/**
 * Sleeps on accept().
 */
@Override
public void run() {
    int numRetries = 0;
    InetSocketAddress addr;
    while((!shutdown) && (numRetries < 3)){       // 1. 有个疑惑 若真的出现 numRetries >= 3 从而退出了, 怎么办
        try {
            ss = new ServerSocket();
            ss.setReuseAddress(true);
            if (self.getQuorumListenOnAllIPs()) { // 2. 这里的默认值 quorumListenOnAllIPs 是 false
                int port = self.quorumPeers.get(self.getId()).electionAddr.getPort();
                addr = new InetSocketAddress(port);
            } else {
                addr = self.quorumPeers.get(self.getId()).electionAddr;
            }
            LOG.info("My election bind port: " + addr.toString());
            setName(self.quorumPeers.get(self.getId()).electionAddr
                    .toString());
            ss.bind(addr);
            while (!shutdown) {
                Socket client = ss.accept();     // 3. 这里会阻塞, 直到有请求到达
                setSockOpts(client);             // 4. 设置 socket 的连接属性
                LOG.info("Received connection request " + client.getRemoteSocketAddress());
                receiveConnection(client);
                numRetries = 0;
            }
        } catch (IOException e) {
            LOG.error("Exception while listening", e);
            numRetries++;
            try {
                ss.close();
                Thread.sleep(1000);
            } catch (IOException ie) {
                LOG.error("Error closing server socket", ie);
            } catch (InterruptedException ie) {
                LOG.error("Interrupted while sleeping. " +
                          "Ignoring exception", ie);
            }
        }
    }
    LOG.info("Leaving listener");
    if (!shutdown) {
        LOG.error("As I'm leaving the listener thread, "
                + "I won't be able to participate in leader "
                + "election any longer: "
                + self.quorumPeers.get(self.getId()).electionAddr);
    }
}

7. QuorumCnxManager.Listener.receiveConnection

为防止重复建立连接, 集群中各个节点之间只允许大的 myid 连接小的 myid, 建立之后会有SendWorker, RecvWorker 来处理消息的接受发送

/**
 * If this server receives a connection request, then it gives up on the new
 * connection if it wins. Notice that it checks whether it has a connection
 * to this server already or not. If it does, then it sends the smallest
 * possible long value to lose the challenge.
 * 
 */
public boolean receiveConnection(Socket sock) {                 // 1.接收 集群之间各个节点的相互的连接
    Long sid = null;
    LOG.info("sock:"+sock);
    try {
        // Read server id
        DataInputStream din = new DataInputStream(sock.getInputStream());
        sid = din.readLong();                                   // 2.读取对应的 myid (这里第一次读取的可能是一个协议的版本号)
        LOG.info("sid:"+sid);
        if (sid < 0) { // this is not a server id but a protocol version (see ZOOKEEPER-1633)
            sid = din.readLong();
            LOG.info("sid:"+sid);
            // next comes the #bytes in the remainder of the message
            int num_remaining_bytes = din.readInt();            // 3.读取这整条消息的长度
            byte[] b = new byte[num_remaining_bytes];           // 4.构建要读取数据长度的字节数组
            // remove the remainder of the message from din
            int num_read = din.read(b);                         // 5.读取消息的内容 (疑惑来了, 这里会不会有拆包断包的情况出现)
            if (num_read != num_remaining_bytes) {              // 6.数据没有读满, 进行日志记录
                LOG.error("Read only " + num_read + " bytes out of " + num_remaining_bytes + " sent by server " + sid);
            }
        }
        if (sid == QuorumPeer.OBSERVER_ID) {                    // 7.连接过来的是一个观察者
            /*
             * Choose identifier at random. We need a value to identify
             * the connection.
             */
            
            sid = observerCounter--;
            LOG.info("Setting arbitrary identifier to observer: " + sid);
        }
    } catch (IOException e) {                                   // 8.这里可能会有 EOFException 表示读取数据到文件尾部, 客户端已经断开, 没什么数据可以读取了, 所以直接关闭 socket
        closeSocket(sock);
        LOG.warn("Exception reading or writing challenge: " + e.toString() + ", sock:"+sock);
        return false;
    }
    
    //If wins the challenge, then close the new connection.
    if (sid < self.getId()) {                                   // 9.在集群中为了防止重复连接, 只能允许大的 myid 连接小的
        /*
         * This replica might still believe that the connection to sid is
         * up, so we have to shut down the workers before trying to open a
         * new connection.
         */
        SendWorker sw = senderWorkerMap.get(sid);               // 10.看看是否已经有 SendWorker, 有的话就进行关闭
        if (sw != null) {
            sw.finish();
        }

        /*
         * Now we start a new connection
         */
        LOG.debug("Create new connection to server: " + sid);
        closeSocket(sock);                                      // 11.关闭 socket
        connectOne(sid);                                        // 12.因为自己的 myid 比对方的大, 所以进行主动连接

        // Otherwise start worker threads to receive data.
    } else {                                                    // 13.自己的 myid 比对方小
        SendWorker sw = new SendWorker(sock, sid);              // 14.建立 SendWorker
        RecvWorker rw = new RecvWorker(sock, sid, sw);          // 15.建立 RecvWorker
        sw.setRecv(rw); 

        SendWorker vsw = senderWorkerMap.get(sid);              // 16.若以前存在 SendWorker, 则进行关闭
        
        if(vsw != null)
            vsw.finish();
        
        senderWorkerMap.put(sid, sw);
        
        if (!queueSendMap.containsKey(sid)) {                   // 17.若不存在 myid 对应的 消息发送 queue, 则就构建一个
            queueSendMap.put(sid, new ArrayBlockingQueue<ByteBuffer>(
                    SEND_CAPACITY));
        }
        
        sw.start();                                             // 18.开启 消息发送 及 接收的线程
        rw.start();
        
        return true;    
    }
    return false;
}

8. QuorumPeer.run

程序最终一直在QuorumPeer.run里面, 而且状态从 LOOKING -> LEADING ->LOOING -> LEADING 一直循环

@Override
public void run() {
    setName("QuorumPeer" + "[myid=" + getId() + "]" +                    // 1. 设置当前线程的名称
            cnxnFactory.getLocalAddress());

    LOG.debug("Starting quorum peer");
    try {
        jmxQuorumBean = new QuorumBean(this);
        MBeanRegistry.getInstance().register(jmxQuorumBean, null);       // 2. 在 QuorumPeer 上包装 QuorumBean 注入到 JMX
        for(QuorumServer s: getView().values()){                         // 3. 遍历每个 ZooKeeperServer 节点
            ZKMBeanInfo p;
            if (getId() == s.id) {
                p = jmxLocalPeerBean = new LocalPeerBean(this);
                try {                                                    // 4. 将 LocalPeerBean 注入到 JMX 里面
                    MBeanRegistry.getInstance().register(p, jmxQuorumBean);
                } catch (Exception e) {
                    LOG.warn("Failed to register with JMX", e);
                    jmxLocalPeerBean = null;
                }
            } else {                                                     // 5. 若myid不是本机, 也注入到 JMX 里面
                p = new RemotePeerBean(s);
                try {
                    MBeanRegistry.getInstance().register(p, jmxQuorumBean);
                } catch (Exception e) {
                    LOG.warn("Failed to register with JMX", e);
                }
            }
        }
    } catch (Exception e) {
        LOG.warn("Failed to register with JMX", e);
        jmxQuorumBean = null;
    }

    try {
        /*
         * Main loop
         */
        while (running) {                                               // 6. QuorumPeer 会一直在这个 while 里面 (一般先是 LOOKING, LEADING/FOLLOWING)
            switch (getPeerState()) {
            case LOOKING:                                               // 7. QuorumPeer 是 LOOKING 状态, 正在寻找 Leader 机器
                LOG.info("LOOKING, and myid is " + myid);

                if (Boolean.getBoolean("readonlymode.enabled")) {       // 8. 判断启动服务是否是 readOnly 模式
                    LOG.info("Attempting to start ReadOnlyZooKeeperServer");

                    // Create read-only server but don't start it immediately
                    final ReadOnlyZooKeeperServer roZk = new ReadOnlyZooKeeperServer(
                            logFactory, this,
                            new ZooKeeperServer.BasicDataTreeBuilder(),
                            this.zkDb);

                    // Instead of starting roZk immediately, wait some grace
                    // period before we decide we're partitioned.
                    //
                    // Thread is used here because otherwise it would require
                    // changes in each of election strategy classes which is
                    // unnecessary code coupling.
                    Thread roZkMgr = new Thread() {
                        public void run() {
                            try {
                                // lower-bound grace period to 2 secs
                                sleep(Math.max(2000, tickTime));
                                if (ServerState.LOOKING.equals(getPeerState())) {
                                    roZk.startup();
                                }
                            } catch (InterruptedException e) {
                                LOG.info("Interrupted while attempting to start ReadOnlyZooKeeperServer, not started");
                            } catch (Exception e) {
                                LOG.error("FAILED to start ReadOnlyZooKeeperServer", e);
                            }
                        }
                    };
                    try {
                        roZkMgr.start();                                 // 9.  这里分两部(a.. QuorumPeer.start().startLeaderElection().createElectionAlgorithm()
                        setBCVote(null);                                 // b. 调用 Election.lookForLeader方法开始选举, 直至 选举成功/其中发生异常
                        setCurrentVote(makeLEStrategy().lookForLeader());// 10. 创建选举 Leader 德策略)选举算法, 在这里可能需要消耗一点时间
                    } catch (Exception e) {
                        LOG.warn("Unexpected exception",e);
                        setPeerState(ServerState.LOOKING);
                    } finally {
                        // If the thread is in the the grace period, interrupt
                        // to come out of waiting.
                        roZkMgr.interrupt();
                        roZk.shutdown();
                    }
                } else {
                    try {                                                // 11. 这里分两部(a. QuorumPeer.start().startLeaderElection().createElectionAlgorithm()
                        setBCVote(null);                                 // b. 调用 Election.lookForLeader方法开始选举, 直至 选举成功/其中发生异常
                        setCurrentVote(makeLEStrategy().lookForLeader());// 12. 选举算法, 在这里可能需要消耗一点时间
                    } catch (Exception e) {
                        LOG.warn("Unexpected exception", e);
                        setPeerState(ServerState.LOOKING);
                    }
                }
                break;
            case OBSERVING:
                try {
                    LOG.info("OBSERVING, and myid is " + myid);
                    setObserver(makeObserver(logFactory));
                    observer.observeLeader();
                } catch (Exception e) {
                    LOG.warn("Unexpected exception",e );                        
                } finally {
                    observer.shutdown();
                    setObserver(null);
                    setPeerState(ServerState.LOOKING);
                }
                break;
            case FOLLOWING:
                try {
                    LOG.info("FOLLOWING, and myid is " + myid);         // 13. 最上层还是 QuorumPeer
                    setFollower(makeFollower(logFactory));              // 14. 初始化 follower, 在 Follower 里面引用 FollowerZooKeeperServer
                    follower.followLeader();                            // 15. 带用 follower.followLeader, 程序会阻塞在这里
                } catch (Exception e) {
                    LOG.warn("Unexpected exception",e);
                } finally {
                    follower.shutdown();
                    setFollower(null);
                    setPeerState(ServerState.LOOKING);
                }
                break;
            case LEADING:
                LOG.info("LEADING, and myid is " + myid);
                try {
                    setLeader(makeLeader(logFactory));                  // 16. 初始化 Leader 对象
                    leader.lead();                                      // 17. Leader 程序会阻塞在这里
                    setLeader(null);
                } catch (Exception e) {
                    LOG.warn("Unexpected exception",e);
                } finally {
                    if (leader != null) {
                        leader.shutdown("Forcing shutdown");
                        setLeader(null);
                    }
                    setPeerState(ServerState.LOOKING);
                }
                break;
            }
        }
    } finally {
        LOG.warn("QuorumPeer main thread exited");
        try {
            MBeanRegistry.getInstance().unregisterAll();
        } catch (Exception e) {
            LOG.warn("Failed to unregister with JMX", e);
        }
        jmxQuorumBean = null;
        jmxLocalPeerBean = null;
    }
}

9. FastLeaderElection.lookForLeader

zookeeper默认使用FastLeaderElection做Leader选举的算法, 接下来直接看代码

/**
 * Starts a new round of leader election. Whenever our QuorumPeer
 * changes its state to LOOKING, this method is invoked, and it
 * sends notifications to all other peers.
 */
// 每个 QuorumPeer 启动时会调用这个方法, 通过这里的调用来完成 Leader 的选举
public Vote lookForLeader() throws InterruptedException {
    LOG.info("QuorumPeer {" + self  + "} is LOOKING !");
    try {
        self.jmxLeaderElectionBean = new LeaderElectionBean();
        MBeanRegistry.getInstance().register(                               // 1. 将 jmxLeaderElectionBean 注册入 JMX 里面 (有个注意点在使用 classLader 时，进行热部署需要 unregister 掉注入到 java核心包里面德 SQL Driver)
                self.jmxLeaderElectionBean, self.jmxLocalPeerBean);
    } catch (Exception e) {
        LOG.warn("Failed to register with JMX", e);
        self.jmxLeaderElectionBean = null;
    }
    if (self.start_fle == 0) {
       self.start_fle = System.currentTimeMillis();
    }
    try {
        HashMap<Long, Vote> recvset = new HashMap<Long, Vote>();                    // 2. 收到的投票信息

        HashMap<Long, Vote> outofelection = new HashMap<Long, Vote>();

        int notTimeout = finalizeWait;

        synchronized(this){
            LOG.info("logicalclock :" + logicalclock);
            logicalclock++;                                               // 3. 获取对应的 myid, zxid, epoch 值
            updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
        }

        LOG.info("New election. My id =  " + self.getId() + ", proposed zxid=0x " + Long.toHexString(proposedZxid));
        LOG.info("sendNotifications to QuorumPeers ");
        sendNotifications();                                              // 4. 先将进行 Leader 选举的信息发送给集群里面德节点 (包括自己)

        /*
         * Loop in which we exchange notifications until we find a leader
         */

        while ((self.getPeerState() == ServerState.LOOKING) && (!stop)){ // 5. 若 QuorumPeer 还处于 LOOKING 状态, 则一直运行下面的 loop, 直到 Leader 选举成功
            /*
             * Remove next notification from queue, times out after 2 times
             * the termination time
             */                                                          // 6. 获取 投票的信息(这里是 Leader/Follower 发给自己德投票德信息)
            Notification n = recvqueue.poll(notTimeout, TimeUnit.MILLISECONDS);
            LOG.info("Notification:"+n);

            /*
             * Sends more notifications if haven't received enough.
             * Otherwise processes new notification.
             */
            if(n == null){                                               // 7. 这里 n == null, 说明有可能 集群之间的节点还没有正真连接上
                if(manager.haveDelivered()){
                    sendNotifications();
                } else {
                    manager.connectAll();                                // 8. 开始连接集群中的各台机器, 连接成功后都会有对应的 SenderWorker ReceiverWorker 与之对应
                }

                /*
                 * Exponential backoff
                 */
                int tmpTimeOut = notTimeout*2;
                notTimeout = (tmpTimeOut < maxNotificationInterval?
                        tmpTimeOut : maxNotificationInterval);
                LOG.info("Notification time out: " + notTimeout);
            }
            else if(self.getVotingView().containsKey(n.sid)) {           // 9. 处理集群中节点发来 Leader 选举德投票消息收到投票的信息
                /*
                 * Only proceed if the vote comes from a replica in the
                 * voting view.
                 */
                switch (n.state) {
                case LOOKING:                                            // 10.消息的来源方说自己也在找 Leader
                    // If notification > current, replace and send messages out
                    if (n.electionEpoch > logicalclock) {                // 11.若果接收到的 notication 的 epoch(选举的轮次)大于当前的轮次
                        logicalclock = n.electionEpoch;
                        recvset.clear();                                 // 12.totalOrderPredicate 将收到的 投票信息与 自己的进行比较(比较的次序依次是 epoch, zxid, myid)
                        boolean totalOrderPredicate = totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
                        LOG.info("n.leader:" + n.leader + ", n.zxid:"+ n.zxid +", n.peerEpoch:"+n.peerEpoch +", getInitId():"+getInitId() +", getInitLastLoggedZxid():"+getInitLastLoggedZxid() + ", getPeerEpoch():"+getPeerEpoch());
                        LOG.info("totalOrderPredicate:"+totalOrderPredicate);
                        if(totalOrderPredicate) {                        // 13.新收到的 Leader 选举信息胜出, 覆盖本地的选举信息
                            updateProposal(n.leader, n.zxid, n.peerEpoch);
                        } else {
                            updateProposal(getInitId(),
                                    getInitLastLoggedZxid(),
                                    getPeerEpoch());
                        }
                        sendNotifications();                             // 14.因为这里 Leader 选举德信息已经更新了, 所以这里 将 Leader 选举的消息发送出去
                    } else if (n.electionEpoch < logicalclock) {         // 15.若果接收到的 notication 的 epoch(选举的轮次)小于当前的轮次, 则直接丢掉
                        LOG.info("Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x"
                                + Long.toHexString(n.electionEpoch)
                                + ", logicalclock=0x" + Long.toHexString(logicalclock));
                        break;                                           // 16.若接收到德选举消息德 epoch 与自己的在同一个选举周期上 totalOrderPredicate 将收到的 投票信息与 自己的进行比较(比较的次序依次是 epoch, zxid, myid)
                    } else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                            proposedLeader, proposedZxid, proposedEpoch)) {
                        updateProposal(n.leader, n.zxid, n.peerEpoch);   // 17.接收到的 Leader新收到的 Notification 胜出, 将Notification里面的内容覆盖本地的选举信息
                        sendNotifications();                             // 18.因为这里 Leader 选举德信息已经更新了, 所以这里 将 Leader 选举的消息发送出去
                    }                                                    // 19. 将收到的投票信息放入投票的集合 recvset 中, 用来作为最终的 "过半原则" 判断
                    Vote vote = new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch);
                    LOG.info("Receive Notification: " + n);
                    LOG.info("Adding vote: " + vote);
                    recvset.put(n.sid, vote);

                                                                         // 20.生成这次Vote 通过 termPredicate判断这次选举是否结束
                    Vote selfVote = new Vote(proposedLeader, proposedZxid, logicalclock, proposedEpoch);
                    boolean termPredicate = termPredicate(recvset,selfVote );    // 21.判断选举是否结束 (默认的就是过半原则)
                    LOG.info("recvset:"+recvset +", || selfVote: " + selfVote);
                    LOG.info("termPredicate:"+termPredicate);
                    if (termPredicate) {                                         // 22.满足过半原则, Leader 选举成功

                        // Verify if there is any change in the proposed leader
                        while((n = recvqueue.poll(finalizeWait,
                                TimeUnit.MILLISECONDS)) != null){                // 23.这时候再从 recvqueue 里面获取 Notification
                            boolean totalOrderPredicate2 = totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                    proposedLeader, proposedZxid, proposedEpoch);// 24.判断是否需要更新 本机(QuorumPeer) 的投票信息
                            LOG.info("totalOrderPredicate2:"+totalOrderPredicate2);
                            if(totalOrderPredicate2){                            // 25.若还需要更新 Leader 的投票信息
                                recvqueue.put(n);                                // 26.则将对方发来的 Notification 放入 recvqueue, 重新等待获取 Notification
                                break;
                            }
                        }

                        /*
                         * This predicate is true once we don't read any new
                         * relevant message from the reception queue
                         */
                        if (n == null) {                                        // 27.若n==null, 说明 Leader 集群中的选举可以定下来了, 修改状态信息 至 Leading
                            self.setPeerState((proposedLeader == self.getId()) ?
                                    ServerState.LEADING: learningState());      // 28.判断这时确认的 Leader 是否是本机, 若是的话, 则更新本机的state为 LEADING

                            Vote endVote = new Vote(proposedLeader,             // 29.组装生成这次 Leader 选举最终的投票的结果
                                                    proposedZxid,
                                                    logicalclock,
                                                    proposedEpoch);
                            leaveInstance(endVote);                             // 30.Leader选举结束, 清空 recvqueue
                            return endVote;                                     // 31.这时会退回到程序的上层, 进行 follower.followLeader() / leader.lead()
                        }
                    }
                    break;
                case OBSERVING:                                                // 32.角色是 OBSERVING 的 QuorumPeer 不参与 Leader 的选举
                    LOG.debug("Notification from observer: " + n.sid);
                    break;
                case FOLLOWING:
                case LEADING:
                    /*
                     * Consider all notifications from the same epoch
                     * together.
                     */
                    if(n.electionEpoch == logicalclock){
                        recvset.put(n.sid, new Vote(n.leader,                  // 33.同样需要将 投票的信息加入到集合里面
                                                      n.zxid,
                                                      n.electionEpoch,
                                                      n.peerEpoch));
                       
                        if(ooePredicate(recvset, outofelection, n)) {          // 34.检测投票是否结束,  Leader 是否已经去人
                            self.setPeerState((n.leader == self.getId()) ?
                                    ServerState.LEADING: learningState());     // 35.在此处进行更行 QuorumPeer 的状态信息 (LEADING / FOLLOWING)

                            Vote endVote = new Vote(n.leader, 
                                    n.zxid, 
                                    n.electionEpoch, 
                                    n.peerEpoch);
                            leaveInstance(endVote);
                            return endVote;
                        }
                    }

                    /*
                     * Before joining an established ensemble, verify
                     * a majority is following the same leader.
                     */
                    outofelection.put(n.sid, new Vote(n.version,
                                                        n.leader,
                                                        n.zxid,
                                                        n.electionEpoch,
                                                        n.peerEpoch,
                                                        n.state));
       
                    if(ooePredicate(outofelection, outofelection, n)) {
                        synchronized(this){
                            logicalclock = n.electionEpoch;
                            self.setPeerState((n.leader == self.getId()) ?
                                    ServerState.LEADING: learningState());
                        }
                        Vote endVote = new Vote(n.leader,
                                                n.zxid,
                                                n.electionEpoch,
                                                n.peerEpoch);
                        leaveInstance(endVote);
                        return endVote;
                    }
                    break;
                default:
                    LOG.warn("Notification state unrecognized: {} (n.state), {} (n.sid)",
                            n.state, n.sid);
                    break;
                }
            } else {
                LOG.warn("Ignoring notification from non-cluster member " + n.sid);
            }
        }
        return null;
    } finally {
        try {
            if(self.jmxLeaderElectionBean != null){
                MBeanRegistry.getInstance().unregister(
                        self.jmxLeaderElectionBean);
            }
        } catch (Exception e) {
            LOG.warn("Failed to unregister with JMX", e);
        }
        self.jmxLeaderElectionBean = null;
    }
}

通过 FastLeaderElection 确定 Leader 后, Leader与Follower就分开来处理, 下面分开进行

10. Leader.lead() 第一部分

在Leader端, 则通过lead()来处理与Follower的交互(这里的过程主要涉及 Follower, Learner, LearnerHandler, Leader 之间的交互, 而且有好几个阻塞的地方)

self.tick = 0;
zk.loadData();                                                      // 2. 从 snapshot, txn log 里面进行恢复 zxid
                                                                                                                                        // 3. 生成 Leader 的状态信息
leaderStateSummary = new StateSummary(self.getCurrentEpoch(), zk.getLastProcessedZxid());
LOG.info("leaderStateSummary:" + leaderStateSummary);
// Start thread that waits for connection requests from 
// new followers.
cnxAcceptor = new LearnerCnxAcceptor();                             // 4. LearnerCnxAcceptor 它会监听在对应端口, 一有 follower 连接上, 就开启一个 LearnerHandler 来处理对应的事件
LOG.info("cnxAcceptor start");
cnxAcceptor.start();

readyToStart = true;                                                // 5. 一开始这个 getAcceptedEpoch 是直接从文件中恢复过来的, 指的是处理过的 Propose
LOG.info("self.getId() :" + self.getId() + ",  self.getAcceptedEpoch():" +  self.getAcceptedEpoch());
                                                                    // 6. 等待足够多de Follower进来, 代表自己确实是 leader, 此处 lead 线程可能在 while 循环处等待
                                                                    // 7. 而在对应的 LearnerHandler 里面， 也会收到对应的 FOLLOWERINFO 数据包, 里面包含 acceptEpoch 数据
long epoch = getEpochToPropose(self.getId(), self.getAcceptedEpoch());

在 Leader 端会建立一个 LearnerCnxAcceptor, 用于连接集群中的其他节点, 建立后会用一个 LearnerHandler 来维护他们之间的交互; 而这时 Leader 会阻塞在 getEpochToPropose里面, 直到有过半 Follower发来信息, 并且在 LearnerHandler里面调用了 Leader.getEpochToPropose后就解除阻塞

11. Follower 连接 Leader

程序先通过 findLeader找到Leader

/**
 * Returns the address of the node we think is the leader.
 */
// 返回 Leader 的网络信息
protected InetSocketAddress findLeader() {
    InetSocketAddress addr = null;
    // Find the leader by id
    Vote current = self.getCurrentVote();            // 获取 QuorumPeer 的投票信息, 里面包含自己Leader选举所投的信息
    for (QuorumServer s : self.getView().values()) {
        if (s.id == current.getId()) {
            addr = s.addr;                          // 获取 Leader 的 addr
            break;
        }
    }
    if (addr == null) {
        LOG.warn("Couldn't find the leader with id = "
                + current.getId());
    }
    return addr;
}

找到 Leader 的地址后就可以和Leader建立连接

/**
 * Establish a connection with the Leader found by findLeader. Retries
 * 5 times before giving up. 
 * @param addr - the address of the Leader to connect to.
 * @throws IOException - if the socket connection fails on the 5th attempt
 * @throws ConnectException
 * @throws InterruptedException
 */
// 连接 leader, 建立成功后, 在 Leader 端会有一个 LearnerHandler 处理与之的通信
protected void connectToLeader(InetSocketAddress addr) throws IOException, ConnectException, InterruptedException {

    sock = new Socket();        
    sock.setSoTimeout(self.tickTime * self.initLimit);          // 1. 这里的 SoTimeout 很重要, 若 InputStream.read 超过这个时间,则会报出 SocketTimeoutException 异常
    for (int tries = 0; tries < 5; tries++) {                   // 2. 连接 Leader 尝试 5次, 若还是失败, 则抛出异常, 一直往外抛出, 直到 QuorumPeer 的重新开始选举 leader run 方法里面 -> 进行选举 Leader
        try {
            sock.connect(addr, self.tickTime * self.syncLimit); // 3. 连接 leader
            sock.setTcpNoDelay(nodelay);                        // 4. 设置 tcpnoDelay <- 这里其实就是禁止 tcp 底层合并小数据包, 一次发送所有数据的 算法
            break;
        } catch (IOException e) {
            if (tries == 4) {
                LOG.error("Unexpected exception",e);
                throw e;
            } else {
                LOG.warn("Unexpected exception, tries="+tries+
                        ", connecting to " + addr,e);
                sock = new Socket();
                sock.setSoTimeout(self.tickTime * self.initLimit);
            }
        }
        Thread.sleep(1000);
    }                                                           // 5. 封装对应的 I/O 数据流
    leaderIs = BinaryInputArchive.getArchive(new BufferedInputStream(sock.getInputStream()));
    bufferedOutput = new BufferedOutputStream(sock.getOutputStream());
    leaderOs = BinaryOutputArchive.getArchive(bufferedOutput); //  6. 封装输出数据流
}

Follower在与Leader建立连接之后会调用registerWithLeader()方法, 与Leader同步确认 epoch 值

12. Learner.registerWithLeader 第一部分

registerWithLeader(Leader.FOLLOWERINFO) 将Follower的zxid及 myid 等信息封装好发送到Leader

LOG.info("registerWithLeader:" + pktType);
/*
 * Send follower info, including last zxid and sid
 */
long lastLoggedZxid = self.getLastLoggedZxid();                     // 获取 Follower 的最后处理的 zxid
QuorumPacket qp = new QuorumPacket();                
qp.setType(pktType);                                                // 若是 Follower ,则当前的角色是  Leader.FOLLOWERINFO
qp.setZxid(ZxidUtils.makeZxid(self.getAcceptedEpoch(), 0));         // Follower 的 lastZxid 的值

/*
 * Add sid to payload
 */
LearnerInfo li = new LearnerInfo(self.getId(), 0x10000);            // 将 Follower 的信息封装成 LearnerInfo
LOG.info("li:" + li);

ByteArrayOutputStream bsid = new ByteArrayOutputStream();
BinaryOutputArchive boa = BinaryOutputArchive.getArchive(bsid);
boa.writeRecord(li, "LearnerInfo");
qp.setData(bsid.toByteArray());                                     // 在 QuorumPacket 里面添加 Follower 的信息
LOG.info("qp:" + qp);

writePacket(qp, true);                                              // 发送 QuorumPacket 包括 learnerInfo 与 Leader.FOLLOWERINFO, 通过 self.getAcceptedEpoch() 构成的 zxid

12. LearnerHandler.run 第一部分

处理Follower发过来的Leader.FOLLOWERINFO信息

tickOfNextAckDeadline = leader.self.tick + leader.self.initLimit + leader.self.syncLimit;
LOG.info("tickOfNextAckDeadline : " + tickOfNextAckDeadline);
                                                                            // 1. 构建与 Follower 之间建立的 socket 成数据流
ia = BinaryInputArchive.getArchive(new BufferedInputStream(sock.getInputStream()));
bufferedOutput = new BufferedOutputStream(sock.getOutputStream());
oa = BinaryOutputArchive.getArchive(bufferedOutput);
                                                                            // 2. 等待 Follower 发来数据包
QuorumPacket qp = new QuorumPacket();
long a1 = System.currentTimeMillis();
ia.readRecord(qp, "packet");                                                // 3. 读取 Follower 发过来的 FOLLOWERINFO 数据包
LOG.info("System.currentTimeMillis() - a1 : " + (System.currentTimeMillis() - a1));
LOG.info("qp:" + qp);
                                                                            // 4. 不应该有这种数据的存在
if(qp.getType() != Leader.FOLLOWERINFO && qp.getType() != Leader.OBSERVERINFO){
    LOG.error("First packet " + qp.toString() + " is not FOLLOWERINFO or OBSERVERINFO!");
    return;
}
byte learnerInfoData[] = qp.getData();                                      // 5. 读取参与者发来的数据

LOG.info("learnerInfoData :" + Arrays.toString(learnerInfoData));           // 6. 这里的 learnerInfo 就是 Follower/Observer 的信息
if (learnerInfoData != null) {
    if (learnerInfoData.length == 8) {
        ByteBuffer bbsid = ByteBuffer.wrap(learnerInfoData);
        this.sid = bbsid.getLong();
    } else {
        LearnerInfo li = new LearnerInfo();                                 // 7. 反序列化出 LearnerInfo
        ByteBufferInputStream.byteBuffer2Record(ByteBuffer.wrap(learnerInfoData), li);
        LOG.info("li :" + li);
        this.sid = li.getServerid();                                        // 8. 取出 Follower 的 myid
        this.version = li.getProtocolVersion();                             // 9. 通讯的协议
    }
} else {
    this.sid = leader.followerCounter.getAndDecrement();
}

LOG.info("Follower sid: " + sid + " : info : " + leader.self.quorumPeers.get(sid));
            
if (qp.getType() == Leader.OBSERVERINFO) {
      learnerType = LearnerType.OBSERVER;
}            

long lastAcceptedEpoch = ZxidUtils.getEpochFromZxid(qp.getZxid());          // 10. 通过 zxid 来获取 Follower 的 Leader 选举的 epoch

LOG.info("qp : " + qp + ", lastAcceptedEpoch : " + lastAcceptedEpoch);

long peerLastZxid;
StateSummary ss = null;
long zxid = qp.getZxid();
long newEpoch = leader.getEpochToPropose(this.getSid(), lastAcceptedEpoch); // 11. 将 Follower 的 Leader 选举的 epoch  信息加入到 connectingFollowers 里面, 判断 集群中过半的Leader参与者了 getEpochToPropose

程序最后调用了leader.getEpochToPropose(), 当集群中有过半的节点发来后, 会在这里解除阻塞
在解除阻塞之后, Leader会向Follower发送LeaderLEADERINFO的信息

byte ver[] = new byte[4];
ByteBuffer.wrap(ver).putInt(0x10000);                                   // 14. 构建出 描述Leader信息的数据包
QuorumPacket newEpochPacket = new QuorumPacket(Leader.LEADERINFO, ZxidUtils.makeZxid(newEpoch, 0), ver, null);
LOG.info("newEpochPacket:" + newEpochPacket);
oa.writeRecord(newEpochPacket, "packet");                               // 15. 将 Leader 的信息发送给对应的 Follower / Observer
bufferedOutput.flush();

而此时Follower在接受到Leader.LEADERINFO信息之后会回复 Leader.ACKEPOCH 信息

// we are connected to a 1.0 server so accept the new epoch and read the next packet
leaderProtocolVersion = ByteBuffer.wrap(qp.getData()).getInt();
LOG.info("leaderProtocolVersion:" + leaderProtocolVersion);
byte epochBytes[] = new byte[4];
final ByteBuffer wrappedEpochBytes = ByteBuffer.wrap(epochBytes);

LOG.info("newEpoch:" + newEpoch + ", self.getAcceptedEpoch():" + self.getAcceptedEpoch());
if (newEpoch > self.getAcceptedEpoch()) {                       // 若 Follower 的 election 的 epoch 值小于自己, 则用 Leader 的
    wrappedEpochBytes.putInt((int)self.getCurrentEpoch());
    self.setAcceptedEpoch(newEpoch);
} else if (newEpoch == self.getAcceptedEpoch()) {
    // since we have already acked an epoch equal to the leaders, we cannot ack
    // again, but we still need to send our lastZxid to the leader so that we can
    // sync with it if it does assume leadership of the epoch.
    // the -1 indicates that this reply should not count as an ack for the new epoch
    wrappedEpochBytes.putInt(-1);
} else {                                                         // 若 Follower.epoch > Leader.epoch 则说明前面的 Leader 选举出错了
    throw new IOException("Leaders epoch, " + newEpoch + " is less than accepted epoch, " + self.getAcceptedEpoch());
}                                                                // 在 接收到 Leader.LEADERINFO 的消息后, 进行回复 Leader.ACKEPOCH 的消息, 并且加上 lastLargestZxid 值
QuorumPacket ackNewEpoch = new QuorumPacket(Leader.ACKEPOCH, lastLoggedZxid, epochBytes, null);

LOG.info("ackNewEpoch:" + ackNewEpoch);
writePacket(ackNewEpoch, true);                                  // 将 ACKEPOCH 信息发送给对方 用于回复Leader发过来的LEADERINFO
return ZxidUtils.makeZxid(newEpoch, 0);

接着就是LearnerHandler对Leader.ACKEPOCH的处理了

byte ver[] = new byte[4];
ByteBuffer.wrap(ver).putInt(0x10000);                                   // 14. 构建出 描述Leader信息的数据包
QuorumPacket newEpochPacket = new QuorumPacket(Leader.LEADERINFO, ZxidUtils.makeZxid(newEpoch, 0), ver, null);
LOG.info("newEpochPacket:" + newEpochPacket);
oa.writeRecord(newEpochPacket, "packet");                               // 15. 将 Leader 的信息发送给对应的 Follower / Observer
bufferedOutput.flush();
QuorumPacket ackEpochPacket = new QuorumPacket();
ia.readRecord(ackEpochPacket, "packet");                                // 16. Leader 读取 Follower 发来的 ACKEPOCH 信息

LOG.info("ackEpochPacket:" +ackEpochPacket);                            // 17. 刚刚发送了 leader 的信息, 现在获取一下确认的 ack

if (ackEpochPacket.getType() != Leader.ACKEPOCH) {
    LOG.error(ackEpochPacket.toString()
            + " is not ACKEPOCH");
    return;
}
ByteBuffer bbepoch = ByteBuffer.wrap(ackEpochPacket.getData());
ss = new StateSummary(bbepoch.getInt(), ackEpochPacket.getZxid());
LOG.info("ss : " + ss);
leader.waitForEpochAck(this.getSid(), ss);                              // 18. 在这边等待所有的 Follower 都回复 ACKEPOCH 值 (这里也是满足过半就可以)

接下来就是 Leader 与 Follower之间的数据同步

13. Learner同步数据至Follower

这里面将涉及下面几个概念

1. committedLog 里面保存着Leader 端处理的最新的500个Proposal
2. 当 Follower处理的Proposal大于 maxCommittedLog, 则 Follower 要TRUNC自己的Proposal至maxCommittedLog
3. 当 Follower处理的Proposal小于 maxCommittedLog, 大于minCommittedLog, 则Leader将Follower没有的Proposal发送到Follower, 让其处理
4. 当 Follower处理的Proposal小于 minCommittedLog, 则Leader发送 Leader.SNAP给FOLLOWER, 并且将自身的数据序列化成数据流, 发送给 Follower

下面直接看代码

/* we are sending the diff check if we have proposals in memory to be able to 
 * send a diff to the 
 */ 
ReentrantReadWriteLock lock = leader.zk.getZKDatabase().getLogLock();
ReadLock rl = lock.readLock();
try {
    rl.lock();                                                             // 20. Leader上将 最近已经提交的 Request 缓存到 ZKDatabase.committedLog里面(这个操作在 FinalRequestProcessor.processRequest 里面操作)  事务的 zxid 会 minCommittedLog -> maxCommittedLog 之间的事务
    final long maxCommittedLog = leader.zk.getZKDatabase().getmaxCommittedLog();
    final long minCommittedLog = leader.zk.getZKDatabase().getminCommittedLog();

    LOG.info("sid:" + sid + ", maxCommittedLog:" + Long.toHexString(maxCommittedLog)
                            + ", minCommittedLog:" +Long.toHexString(minCommittedLog)
                            + " peerLastZxid=0x"+Long.toHexString(peerLastZxid)
    );


    /**
     * http://www.jianshu.com/p/4cc1040b6a14
     * 获取 Leader 已经提交的 Request 数据
     * 1) 若 lastzxid 在 min 和 max 之间
     *      循环 proposals
     *      a) 当单个 proposal 的zxid <= 当前的 peerLastZxid 时, 说明已经提交过了, 直接跳过
     *      b) 当 proposal 的zxid 大于 peerLastZxid 时, 则删除小于 peerLastZxid部分, 因为已经提交过了, 剩余部分继续做 commit 操作,
     *          因此在所有 commit 之前, 先发送一个 trunc 事件, 删除已经提交过的部分, 然后发送需要的 commit 的相关节点
     * 2) 如果当前的 peerLastZxid 大于 max, 则全部做 TRUNC
     * 3) 剩下的不处理, 可能是新加入的节点, 所以事件类型为 SNAP, 同步数据时直接取快照
     */

    LinkedList<Proposal> proposals = leader.zk.getZKDatabase().getCommittedLog();           // 21. 获取Leader 上最近提交的Request, 查看是否还有需要的投票
    LOG.info("proposals:"+proposals);
    if (proposals.size() != 0) {                                                            // 22. proposals 里面存储的是已经提交的  Proposal
        LOG.debug("proposal size is {}", proposals.size());

        if ((maxCommittedLog >= peerLastZxid) && (minCommittedLog <= peerLastZxid)) {       // 23. 若这个 If 条件成立, 说明 Follower 与 Leader 之间有少于 500 条数据未同步
            LOG.info("sid:" + sid + ", maxCommittedLog:" + Long.toHexString(maxCommittedLog)
                    + ", minCommittedLog:" +Long.toHexString(minCommittedLog)
                    + " peerLastZxid=0x"+Long.toHexString(peerLastZxid)
            );
            LOG.debug("Sending proposals to follower");

            // as we look through proposals, this variable keeps track of previous
            // proposal Id.
            long prevProposalZxid = minCommittedLog;

            // Keep track of whether we are about to send the first packet.
            // Before sending the first packet, we have to tell the learner
            // whether to expect a trunc or a diff
            boolean firstPacket=true;

            // If we are here, we can use committedLog to sync with
            // follower. Then we only need to decide whether to
            // send trunc or not
            packetToSend = Leader.DIFF;
            zxidToSend = maxCommittedLog;

            for (Proposal propose: proposals) {
                // skip the proposals the peer already has                                 // 24. 这个 Propose 已经处理过了, continue
                if (propose.packet.getZxid() <= peerLastZxid) {
                    prevProposalZxid = propose.packet.getZxid();                           // 25. 若 follower 已经处理过, 则更新 prevProposalZxid, 轮询下个 Proposal
                    continue;
                } else {
                    // If we are sending the first packet, figure out whether to trunc
                    // in case the follower has some proposals that the leader doesn't
                    if (firstPacket) {                                                     // 26. 在发起 Proposal 之前一定要确认 是否 follower 比 Leader 超前处理 Proposal
                        firstPacket = false;
                        // Does the peer have some proposals that the leader hasn't seen yet
                        if (prevProposalZxid < peerLastZxid) {                             // 27. follower 的处理事务处理比 leader 多, 也就是说prevProposalZxid这时就是maxCommittedLog,   则发送 TRUC 进行 Proposal 数据同步
                            // send a trunc message before sending the diff
                            packetToSend = Leader.TRUNC;                                        
                            zxidToSend = prevProposalZxid;
                            updates = zxidToSend;
                        }
                    }
                    queuePacket(propose.packet);                                           // 28. 将 事务发送到 发送队列里面
                    QuorumPacket qcommit = new QuorumPacket(Leader.COMMIT, propose.packet.getZxid(),
                            null, null);
                    queuePacket(qcommit);                                                  // 29. 紧接着发送一个 commit, 让 Follower 来进行提交 request
                }
            }
        } else if (peerLastZxid > maxCommittedLog) {                                       // 30. follower 的处理事务处理比 leader 多, 则发送 TRUC 进行 Proposal 数据同步
            LOG.debug("Sending TRUNC to follower zxidToSend=0x{} updates=0x{}",
                    Long.toHexString(maxCommittedLog),
                    Long.toHexString(updates));

            LOG.info("sid:" + sid + ", maxCommittedLog:" + Long.toHexString(maxCommittedLog)
                    + ", minCommittedLog:" +Long.toHexString(minCommittedLog)
                    + " peerLastZxid=0x"+Long.toHexString(peerLastZxid)
                    + ", updates : " + Long.toHexString(updates)
            );

            packetToSend = Leader.TRUNC;                                                   // 31. 发送 TRUNC, Follower 将删除比 Leader 多的 Request
            zxidToSend = maxCommittedLog;                                                  // 32. 这里的 maxCommittedLog 是 Leader 处理过的最大 Request的 zxid
            updates = zxidToSend;
        } else {
            LOG.warn("Unhandled proposal scenario");
        }                                                                                  // 33. 若 Follower 与 Leader 的 lastZxid 相同, 则 发送 DIFF
    } else if (peerLastZxid == leader.zk.getZKDatabase().getDataTreeLastProcessedZxid()) {
        // The leader may recently take a snapshot, so the committedLog
        // is empty. We don't need to send snapshot if the follow
        // is already sync with in-memory db.
        LOG.info("committedLog is empty but leader and follower "
                        + "are in sync, zxid=0x{}",
                Long.toHexString(peerLastZxid));

        LOG.info("sid:" + sid + ", maxCommittedLog:" + Long.toHexString(maxCommittedLog)
                + ", minCommittedLog:" +Long.toHexString(minCommittedLog)
                + " peerLastZxid=0x"+Long.toHexString(peerLastZxid)
        );

        packetToSend = Leader.DIFF;
        zxidToSend = peerLastZxid;
    } else {
        // just let the state transfer happen
        LOG.debug("proposals is empty");
    }               


    LOG.info("Sending " + Leader.getPacketType(packetToSend));
    leaderLastZxid = leader.startForwarding(this, updates);                               // 34. leader 将没有持久化但已经过半 ACK 确认过了的Proposal发给 Learner (这里就是细节)
    LOG.info("leaderLastZxid : " + leaderLastZxid);

} finally {
    rl.unlock();
}

对于消息的发送 Leader 最后发送一个 Leader.NEWLEADER

 QuorumPacket newLeaderQP = new QuorumPacket(Leader.NEWLEADER, ZxidUtils.makeZxid(newEpoch, 0), null, null);
 LOG.info("newLeaderQP:" + newLeaderQP);
 if (getVersion() < 0x10000) {
    oa.writeRecord(newLeaderQP, "packet");
} else {
    queuedPackets.add(newLeaderQP);                                                       // 36. 将 Leader.NEWLEADER 的数据包加入到发送队列(注意此时还没有启动发送队列的线程)
}
bufferedOutput.flush();

14. Follower处理与Leader之间的数据同步

synchronized (zk) {
    if (qp.getType() == Leader.DIFF) {                              // DIFF 数据包(DIFF数据包显示集群中两个节点的 lastZxid 相同)
        LOG.info("Getting a diff from the leader 0x" + Long.toHexString(qp.getZxid()));                
    }
    else if (qp.getType() == Leader.SNAP) {                         // 收到的信息是 snap, 则从 leader 复制一份 镜像数据到本地(Leader比Follower处理的Proposal多至少500个)
        LOG.info("Getting a snapshot from leader");
        // The leader is going to dump the database
        // clear our own database and read
        zk.getZKDatabase().clear();
        zk.getZKDatabase().deserializeSnapshot(leaderIs);           // 从 InputStream 里面 反序列化出 DataTree
        String signature = leaderIs.readString("signature");        // 看了一个 读取 tag "signature" 代表的一个 String 对象
        if (!signature.equals("BenWasHere")) {
            LOG.error("Missing signature. Got " + signature);
            throw new IOException("Missing signature");                   
        }
    } else if (qp.getType() == Leader.TRUNC) {                     // 回滚到对应的事务到 qp.getZxid()(Follower处理的事务比Leader多)
        //we need to truncate the log to the lastzxid of the leader
        LOG.warn("Truncating log to get in sync with the leader 0x" + Long.toHexString(qp.getZxid()));
        boolean truncated=zk.getZKDatabase().truncateLog(qp.getZxid());
        LOG.info("truncated:" + truncated + ", qp.getZxid():" + qp.getZxid());
        if (!truncated) {
            // not able to truncate the log
            LOG.error("Not able to truncate the log "
                    + Long.toHexString(qp.getZxid()));
            System.exit(13);
        }

    }
    else {
        LOG.error("Got unexpected packet from leader "
                + qp.getType() + " exiting ... " );
        System.exit(13);

    }
    zk.getZKDatabase().setlastProcessedZxid(qp.getZxid());          // 因为这里的 ZKDatatree 是从 Leader 的 SnapShot 的 InputStream 里面获取的, 所以调用这里通过 set 进行赋值
    zk.createSessionTracker();                                      // Learner 创建对应的 SessionTracker (Follower/Observer)(LearnerSessionTracker)
    
    long lastQueued = 0;

    // in V1.0 we take a snapshot when we get the NEWLEADER message, but in pre V1.0
    // we take the snapshot at the UPDATE, since V1.0 also gets the UPDATE (after the NEWLEADER)
    // we need to make sure that we don't take the snapshot twice.
    boolean snapshotTaken = false;
    // we are now going to start getting transactions to apply followed by an UPTODATE
    outerLoop:
    while (self.isRunning()) {                                     // 这里的 self.isRunning() 默认就是 true
        readPacket(qp);

        LOG.info("qp:" + qp);

        switch(qp.getType()) {
        case Leader.PROPOSAL:                                     // 将投票信息加入到 待处理列表
            PacketInFlight pif = new PacketInFlight();
            pif.hdr = new TxnHeader();
            pif.rec = SerializeUtils.deserializeTxn(qp.getData(), pif.hdr);         // 反序列化对应的 请求事务体
            LOG.info("pif:" + pif);
            if (pif.hdr.getZxid() != lastQueued + 1) {
            LOG.warn("Got zxid 0x"
                    + Long.toHexString(pif.hdr.getZxid())
                    + " expected 0x"
                    + Long.toHexString(lastQueued + 1));
            }
            lastQueued = pif.hdr.getZxid();
            packetsNotCommitted.add(pif);
            break;
        case Leader.COMMIT:                                        // commit 则将事务提交给 Server 处理
            LOG.info("snapshotTaken :" + snapshotTaken);
            if (!snapshotTaken) {
                pif = packetsNotCommitted.peekFirst();
                if (pif.hdr.getZxid() != qp.getZxid()) {
                    LOG.warn("Committing " + qp.getZxid() + ", but next proposal is " + pif.hdr.getZxid());
                } else {
                    zk.processTxn(pif.hdr, pif.rec);               // 处理对应的事件
                    packetsNotCommitted.remove();
                }
            } else {
                packetsCommitted.add(qp.getZxid());
            }
            break;
        case Leader.INFORM:                                                         // 这个 INFORM 只有Observer 才会处理
            /*
             * Only observer get this type of packet. We treat this
             * as receiving PROPOSAL and COMMMIT.
             */
            PacketInFlight packet = new PacketInFlight();
            packet.hdr = new TxnHeader();
            packet.rec = SerializeUtils.deserializeTxn(qp.getData(), packet.hdr);
            LOG.info("packet:" + packet);
            // Log warning message if txn comes out-of-order
            if (packet.hdr.getZxid() != lastQueued + 1) {
                LOG.warn("Got zxid 0x"
                        + Long.toHexString(packet.hdr.getZxid())
                        + " expected 0x"
                        + Long.toHexString(lastQueued + 1));
            }
            lastQueued = packet.hdr.getZxid();
            LOG.info("snapshotTaken : " + snapshotTaken);
            if (!snapshotTaken) {
                // Apply to db directly if we haven't taken the snapshot
                zk.processTxn(packet.hdr, packet.rec);
            } else {
                packetsNotCommitted.add(packet);
                packetsCommitted.add(qp.getZxid());
            }
            break;
        case Leader.UPTODATE:                                               // UPTODATE 数据包, 说明同步数据成功, 退出循环
            LOG.info("snapshotTaken : " + snapshotTaken + ", newEpoch:" + newEpoch);
            if (!snapshotTaken) { // true for the pre v1.0 case
                zk.takeSnapshot();
                self.setCurrentEpoch(newEpoch);
            }
            self.cnxnFactory.setZooKeeperServer(zk);                
            break outerLoop;                                                // 获取 UPTODATE 后 退出 while loop
        case Leader.NEWLEADER: // it will be NEWLEADER in v1.0              // 说明之前残留的投票已经处理完, 则将内存中的数据写入文件, 并发送 ACK 包
            LOG.info("newEpoch:" + newEpoch);
            // Create updatingEpoch file and remove it after current
            // epoch is set. QuorumPeer.loadDataBase() uses this file to
            // detect the case where the server was terminated after
            // taking a snapshot but before setting the current epoch.
            File updating = new File(self.getTxnFactory().getSnapDir(),
                                QuorumPeer.UPDATING_EPOCH_FILENAME);
            if (!updating.exists() && !updating.createNewFile()) {
                throw new IOException("Failed to create " +
                                      updating.toString());
            }
            zk.takeSnapshot();
            self.setCurrentEpoch(newEpoch);
            if (!updating.delete()) {
                throw new IOException("Failed to delete " +
                                      updating.toString());
            }
            snapshotTaken = true;
            writePacket(new QuorumPacket(Leader.ACK, newLeaderZxid, null, null), true);
            break;
        }
    }
}

接着Follower将发送NEWLEADER对应的ACK信息, 并且处理数据同步时Leader发送过来的Proposal消息, 紧接着Followerjiuzai 一个while循环里面一直读取数据包并处理数据包; 与之对应的是LearnerHandler, LearnerHandler最后就在 while loop 里面一直处理与Follower之间的消息

15. 总结

整个Leader/Follower启动, Leader选举, Leader与Follower之间数据的同步涉及好几个步骤, 细节点比较多, 还好总体线路比较清晰, 若想了解的话, 可以先看一下片头的那个流程图!

1. ZooKeeper Leader/Follower 启动, Leader 选举, Leader/Follower 建立 概述