你不知道的System_Server ANR

平时谈到ANR问题,地球人都熟悉的是APP出现ANR,但是android中还存在一种anr问题,同样是往/data/anr/traces.txt里写东西,这就是System_Server的anr问题。熟悉framework的同学system_server中跑了很多重要的service,例如ams、wms等。所以如果这些service卡住也会造成系统稳定性的问题,android就引入了一个WatchDog,看门狗在嵌入式系统中是很常见的功能,当系统跑飞的时候能重启。

app的anr是backtraces是由ams产生的,system_server则是有WD产生。

WD有监控每个重要的service,例如ActivityManagerService

public final class ActivityManagerService extends ActivityManagerNative
        implements Watchdog.Monitor, BatteryStatsImpl.BatteryCallback {

这里可以看出,ams是实现了Watchdog的Monitor接口

public void monitor() {
        synchronized (this) { }
    }

就是去获得ams对象本身,为什么这样设计这个接口?

因为在ActivityManagerService的许多实际工作函数入口就会首先获得ams对象这把锁,所以monitor这个函数的实际意义就是监控ActivityManagerService陷入到某个操作多长时间,如果陷入时间超时,这个monitor函数肯定不会返回,就可以触发watchdog开启重置功能。

public class Watchdog extends Thread {
......
}

WD本质上是一个线程,同时它在system_server中是单例的。

在System_Server中:

Watchdog.getInstance().start();

通过这样的形式将WD起起来,然后我们看这个线程的主loop在做什么:

代码略长,先给出每个阶段示意:

线程一直都在一个大的while循环中

1.遍历WD持有的所有handlerchecker对象,让每个checker分别进行检测工作

2.线程等待一个check周期时间

3.计算当前各个service的状态

4.根据第3步计算出来的状态做决策,是继续check还是判定为anr条件成立,开始杀进程

//thread 主运行
    @Override
    public void run() {
        boolean waitedHalf = false;
        while (true) {
            final ArrayList<HandlerChecker> blockedCheckers;
            final String subject;
            final boolean allowRestart;
            int debuggerWasConnected = 0;
            synchronized (this) {
                long timeout = CHECK_INTERVAL;
                // Make sure we (re)spin the checkers that have become idle within
                // this wait-and-check interval
/*step1.遍历每个checker,让其执行自身的检测工作*/
                for (int i=0; i<mHandlerCheckers.size(); i++) {
                    HandlerChecker hc = mHandlerCheckers.get(i);
                    hc.scheduleCheckLocked();
                }

......
                // NOTE: We use uptimeMillis() here because we do not want to increment the time we
                // wait while asleep. If the device is asleep then the thing that we are waiting
                // to timeout on is asleep as well and won't have a chance to run, causing a false
                // positive on when to kill things.
/*step2.线程等待一个check周期*/
                long start = SystemClock.uptimeMillis();
                while (timeout > 0) {//timeout是半分钟
                    if (Debug.isDebuggerConnected()) {
                        debuggerWasConnected = 2;
                    }
                    try {
                        wait(timeout);//等待半分钟
                    } catch (InterruptedException e) {
                        Log.wtf(TAG, e);
                    }
                    if (Debug.isDebuggerConnected()) {
                        debuggerWasConnected = 2;
                    }
                    timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);//update timeout
                }

/*step3.计算所有checker的状态,本质上就是三种状态:1.未block2.等待了半个周期3.等待了整个周期。之所以会设计半个和整个周期的区别,watchdog还会放卡住半个周期的Service一马,再宽限其半个周期*/			
                final int waitState = evaluateCheckerCompletionLocked();
/*step4.根据第3步计算出的状态,做出抉择,继续下次loop,还是判断当前已经anr准备杀!*/
                if (waitState == COMPLETED) {//前半分钟已经确定各个checker都ok的情况下,再跳回去loop,再反复一直check
                    // The monitors have returned; reset
                    waitedHalf = false;
                    continue;
                } else if (waitState == WAITING) {//如果这个半分钟之内存在checker都是处于waiting状态
                    // still waiting but within their configured intervals; back off and recheck
                    continue;
                } else if (waitState == WAITED_HALF) {
                    if (!waitedHalf) {
                        // We've waited half the deadlock-detection interval.  Pull a stack
                        // trace and wait another half.
                        ArrayList<Integer> pids = new ArrayList<Integer>();
                        pids.add(Process.myPid());
/*调用AMS的dumpstacktrace函数生成traces.txt文件*/
                        ActivityManagerService.dumpStackTraces(true, pids, null, null,
                                NATIVE_STACKS_OF_INTEREST);
                        waitedHalf = true;
                    }
                    continue;
                }

                // something is overdue!
/*查出是那个checker block住,同时生成一个string subject用于log输出*/
                blockedCheckers = getBlockedCheckersLocked();
                subject = describeCheckersLocked(blockedCheckers);
                allowRestart = mAllowRestart;
            }

            // If we got here, that means that the system is most likely hung.
            // First collect stack traces from all threads of the system process.
            // Then kill this process so that the system will restart.
            EventLog.writeEvent(EventLogTags.WATCHDOG, subject);

            ArrayList<Integer> pids = new ArrayList<Integer>();
            pids.add(Process.myPid());
            if (mPhonePid > 0) pids.add(mPhonePid);
            // Pass !waitedHalf so that just in case we somehow wind up here without having
            // dumped the halfway stacks, we properly re-initialize the trace file.
            final File stack = ActivityManagerService.dumpStackTraces(
                    !waitedHalf, pids, null, null, NATIVE_STACKS_OF_INTEREST);

......
/*后续完成一些收尾工作打印WD log的情况,好多看不太懂,不算是核心思想就略过*/
    }

现在的核心就转为研究HandlerChecker上

public final class HandlerChecker implements Runnable {
}

HandlerChecker实现runnable接口,本质上就是能够往looper上丢,然后自我运行

        @Override
        public void run() {//这个checker
            final int size = mMonitors.size();
            for (int i = 0 ; i < size ; i++) {
                synchronized (Watchdog.this) {
                    mCurrentMonitor = mMonitors.get(i);
                }
                mCurrentMonitor.monitor();
            }

			//前面这里没有被卡死,就会到这里把COMPLETED设置为true
            synchronized (Watchdog.this) {
                mCompleted = true;
                mCurrentMonitor = null;
            }
        }

这个runnable就会遍历所有monitor,调用这些monitor的monitor方法。

嘿嘿,其实这些monitor就是各个ams、wms,这些service的都继承了WD的monitor方法,这些service起来的时候就会获取wd的单例,然后addmonitor将自己加进去。

所以watchdog在检测system_server的稳定运行中就会这样起作用。但是system_server出现anr是不会弹出dialog的,只会直接死机掉。然后app检测到binder的server端挂掉后,自己也纷纷挂掉,然后系统就呵呵了,就会自动重启了。

    原文作者:Kelvin wu
    原文地址: https://zhuanlan.zhihu.com/p/20488872
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞