你不知道的System_Server ANR

2023年5月6日 381次阅读来源: Kelvin wu

平时谈到ANR问题，地球人都熟悉的是APP出现ANR，但是android中还存在一种anr问题，同样是往/data/anr/traces.txt里写东西，这就是System_Server的anr问题。熟悉framework的同学system_server中跑了很多重要的service，例如ams、wms等。所以如果这些service卡住也会造成系统稳定性的问题，android就引入了一个WatchDog，看门狗在嵌入式系统中是很常见的功能，当系统跑飞的时候能重启。

app的anr是backtraces是由ams产生的，system_server则是有WD产生。

WD有监控每个重要的service，例如ActivityManagerService

public final class ActivityManagerService extends ActivityManagerNative
        implements Watchdog.Monitor, BatteryStatsImpl.BatteryCallback {

这里可以看出，ams是实现了Watchdog的Monitor接口

public void monitor() {
        synchronized (this) { }
    }

就是去获得ams对象本身，为什么这样设计这个接口？

因为在ActivityManagerService的许多实际工作函数入口就会首先获得ams对象这把锁，所以monitor这个函数的实际意义就是监控ActivityManagerService陷入到某个操作多长时间，如果陷入时间超时，这个monitor函数肯定不会返回，就可以触发watchdog开启重置功能。

public class Watchdog extends Thread {
......
}

WD本质上是一个线程，同时它在system_server中是单例的。

在System_Server中：

Watchdog.getInstance().start();

通过这样的形式将WD起起来，然后我们看这个线程的主loop在做什么：

代码略长，先给出每个阶段示意：

线程一直都在一个大的while循环中

1.遍历WD持有的所有handlerchecker对象，让每个checker分别进行检测工作

2.线程等待一个check周期时间

3.计算当前各个service的状态

4.根据第3步计算出来的状态做决策，是继续check还是判定为anr条件成立，开始杀进程

//thread 主运行
    @Override
    public void run() {
        boolean waitedHalf = false;
        while (true) {
            final ArrayList<HandlerChecker> blockedCheckers;
            final String subject;
            final boolean allowRestart;
            int debuggerWasConnected = 0;
            synchronized (this) {
                long timeout = CHECK_INTERVAL;
                // Make sure we (re)spin the checkers that have become idle within
                // this wait-and-check interval
/*step1.遍历每个checker，让其执行自身的检测工作*/
                for (int i=0; i<mHandlerCheckers.size(); i++) {
                    HandlerChecker hc = mHandlerCheckers.get(i);
                    hc.scheduleCheckLocked();
                }

......
                // NOTE: We use uptimeMillis() here because we do not want to increment the time we
                // wait while asleep. If the device is asleep then the thing that we are waiting
                // to timeout on is asleep as well and won't have a chance to run, causing a false
                // positive on when to kill things.
/*step2.线程等待一个check周期*/
                long start = SystemClock.uptimeMillis();
                while (timeout > 0) {//timeout是半分钟
                    if (Debug.isDebuggerConnected()) {
                        debuggerWasConnected = 2;
                    }
                    try {
                        wait(timeout);//等待半分钟
                    } catch (InterruptedException e) {
                        Log.wtf(TAG, e);
                    }
                    if (Debug.isDebuggerConnected()) {
                        debuggerWasConnected = 2;
                    }
                    timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);//update timeout
                }

/*step3.计算所有checker的状态，本质上就是三种状态：1.未block2.等待了半个周期3.等待了整个周期。之所以会设计半个和整个周期的区别，watchdog还会放卡住半个周期的Service一马，再宽限其半个周期*/			
                final int waitState = evaluateCheckerCompletionLocked();
/*step4.根据第3步计算出的状态，做出抉择，继续下次loop，还是判断当前已经anr准备杀！*/
                if (waitState == COMPLETED) {//前半分钟已经确定各个checker都ok的情况下，再跳回去loop，再反复一直check
                    // The monitors have returned; reset
                    waitedHalf = false;
                    continue;
                } else if (waitState == WAITING) {//如果这个半分钟之内存在checker都是处于waiting状态
                    // still waiting but within their configured intervals; back off and recheck
                    continue;
                } else if (waitState == WAITED_HALF) {
                    if (!waitedHalf) {
                        // We've waited half the deadlock-detection interval.  Pull a stack
                        // trace and wait another half.
                        ArrayList<Integer> pids = new ArrayList<Integer>();
                        pids.add(Process.myPid());
/*调用AMS的dumpstacktrace函数生成traces.txt文件*/
                        ActivityManagerService.dumpStackTraces(true, pids, null, null,
                                NATIVE_STACKS_OF_INTEREST);
                        waitedHalf = true;
                    }
                    continue;
                }

                // something is overdue!
/*查出是那个checker block住，同时生成一个string subject用于log输出*/
                blockedCheckers = getBlockedCheckersLocked();
                subject = describeCheckersLocked(blockedCheckers);
                allowRestart = mAllowRestart;
            }

            // If we got here, that means that the system is most likely hung.
            // First collect stack traces from all threads of the system process.
            // Then kill this process so that the system will restart.
            EventLog.writeEvent(EventLogTags.WATCHDOG, subject);

            ArrayList<Integer> pids = new ArrayList<Integer>();
            pids.add(Process.myPid());
            if (mPhonePid > 0) pids.add(mPhonePid);
            // Pass !waitedHalf so that just in case we somehow wind up here without having
            // dumped the halfway stacks, we properly re-initialize the trace file.
            final File stack = ActivityManagerService.dumpStackTraces(
                    !waitedHalf, pids, null, null, NATIVE_STACKS_OF_INTEREST);

......
/*后续完成一些收尾工作打印WD log的情况，好多看不太懂，不算是核心思想就略过*/
    }

现在的核心就转为研究HandlerChecker上

public final class HandlerChecker implements Runnable {
}

HandlerChecker实现runnable接口，本质上就是能够往looper上丢，然后自我运行

        @Override
        public void run() {//这个checker
            final int size = mMonitors.size();
            for (int i = 0 ; i < size ; i++) {
                synchronized (Watchdog.this) {
                    mCurrentMonitor = mMonitors.get(i);
                }
                mCurrentMonitor.monitor();
            }

			//前面这里没有被卡死，就会到这里把COMPLETED设置为true
            synchronized (Watchdog.this) {
                mCompleted = true;
                mCurrentMonitor = null;
            }
        }

这个runnable就会遍历所有monitor，调用这些monitor的monitor方法。

嘿嘿，其实这些monitor就是各个ams、wms，这些service的都继承了WD的monitor方法，这些service起来的时候就会获取wd的单例，然后addmonitor将自己加进去。

所以watchdog在检测system_server的稳定运行中就会这样起作用。但是system_server出现anr是不会弹出dialog的，只会直接死机掉。然后app检测到binder的server端挂掉后，自己也纷纷挂掉，然后系统就呵呵了，就会自动重启了。

    原文作者：Kelvin wu
    原文地址: https://zhuanlan.zhihu.com/p/20488872
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。