ZooKeeper数据不一致的定位过程 (3.4.11)

2023年10月7日 329次阅读来源: Jiang阿涵

现象

ZooKeeper读写过程中，重新选主，然后节点重启后，数据不一致了。例如原来有节点A，B，C。

创建临时节点znode1，节点A、B、C上均可见，此时节点B是leader。

重启A、B、C三个节点后，发现临时节点znode1在A、C上可见，但是在B上不可见，且重启ZooKeeper进程无法解决。

分析

通过分析ZooKeeper的事务log可以看出，B节点的log比A、C多了几项，这几项为CloseSession类型的事务。也就是说在B节点这里，相关的临时节点已经被删除，但是在A、C那里，相关的临时节点没有被删除。

此时A、B、C的日志看起来是这样的：

A日志：txn1,txn2,txn3,txn8,txn9,txn10

B日志：txn1,txn2,txn3,txn4,txn5,txn6,txn7,txn8,txn9,txn10

C日志：txn1,txn2,txn3,txn8,txn9,txn10

也就是说B的txn4,txn5,txn6,txn7这几条日志，A、C都没有。

说好的一致性呢！！！

继续分析，那么在B节点创建CloseSession相关事务日志的时候，发生什么了呢？从B节点的日志中，发现在B节点创建日志（假设是txn4）后，不久就发生了重新选主，原因是网络不通。

重新选主的结果，B还是leader，于是B就开始和A、C同步日志。同步的时候，会把日志的范围打印出来，我看了一下，发现A只把txn4之前的日志同步过去了。

这不科学啊！

接下来又去看源代码，发现同步日志的范围，是以内存里的最大日志编号来决定了，注意是内存，而不是硬盘里真实的最大编号。那么重新选主后，由于ZooKeeper Server关闭了，按理说新的ZooKeeper Server会重新加载日志，并且把内存里的最大编号也更新到最新的。

然而为什么没有呢？

继续看代码，原来在关闭ZooKeeper Server的时候，有一个哥们，为了提高性能（我猜测），并没有把server相关的db（对应硬盘和内存里的数据）也关闭。这样新的ZooKeeper Server在new的时候，就可以直接用这个db。也正是因为这样，db里内存部分的数据，跟硬盘里的数据，没有匹配上。我一看更新的时间，2017年2月，哥们啊，ZooKeeper源代码真的不敢乱改。

到这里我基本上已经确定这个bug是由于这个哥们的改动造成的了。然后提bug之前，我看了一眼github上ZooKeeper 3.4.12版本的代码，果不其然，已经有人fix掉这个bug了。

原来的代码是这样的：

/**
     * Shut down the server instance
     * @param fullyShutDown true if another server using the same database will not replace this one in the same process
     */
    public synchronized void shutdown(boolean fullyShutDown) {
        if (!canShutdown()) {
            LOG.debug("ZooKeeper server is not running, so not proceeding to shutdown!");
            return;
        }
        LOG.info("shutting down");

        // new RuntimeException("Calling shutdown").printStackTrace();
        setState(State.SHUTDOWN);
        // Since sessionTracker and syncThreads poll we just have to
        // set running to false and they will detect it during the poll
        // interval.
        if (sessionTracker != null) {
            sessionTracker.shutdown();
        }
        if (firstProcessor != null) {
            firstProcessor.shutdown();
        }

        if (fullyShutDown && zkDb != null) {
            zkDb.clear();
        }
        // else there is no need to clear the database
        //  * When a new quorum is established we can still apply the diff
        //    on top of the same zkDb data
        //  * If we fetch a new snapshot from leader, the zkDb will be
        //    cleared anyway before loading the snapshot

        unregisterJMX();
    }

新的代码已经改成这样了：

    /**
     * Shut down the server instance
     * @param fullyShutDown true if another server using the same database will not replace this one in the same process
     */
    public synchronized void shutdown(boolean fullyShutDown) {
        if (!canShutdown()) {
            LOG.debug("ZooKeeper server is not running, so not proceeding to shutdown!");
            return;
        }
        LOG.info("shutting down");

        // new RuntimeException("Calling shutdown").printStackTrace();
        setState(State.SHUTDOWN);
        // Since sessionTracker and syncThreads poll we just have to
        // set running to false and they will detect it during the poll
        // interval.
        if (sessionTracker != null) {
            sessionTracker.shutdown();
        }
        if (firstProcessor != null) {
            firstProcessor.shutdown();
        }

        if (zkDb != null) {
            if (fullyShutDown) {
                zkDb.clear();
            } else {
                // else there is no need to clear the database
                //  * When a new quorum is established we can still apply the diff
                //    on top of the same zkDb data
                //  * If we fetch a new snapshot from leader, the zkDb will be
                //    cleared anyway before loading the snapshot
                try {
                    //This will fast forward the database to the latest recorded transactions
                    zkDb.fastForwardDataBase();
                } catch (IOException e) {
                    LOG.error("Error updating DB", e);
                    zkDb.clear();
                }
            }
        }

        unregisterJMX();
    }