平时谈到ANR问题,地球人都熟悉的是APP出现ANR,但是android中还存在一种anr问题,同样是往/data/anr/traces.txt里写东西,这就是System_Server的anr问题。熟悉framework的同学system_server中跑了很多重要的service,例如ams、wms等。所以如果这些service卡住也会造成系统稳定性的问题,android就引入了一个WatchDog,看门狗在嵌入式系统中是很常见的功能,当系统跑飞的时候能重启。
app的anr是backtraces是由ams产生的,system_server则是有WD产生。
WD有监控每个重要的service,例如ActivityManagerService
public final class ActivityManagerService extends ActivityManagerNative
implements Watchdog.Monitor, BatteryStatsImpl.BatteryCallback {
这里可以看出,ams是实现了Watchdog的Monitor接口
public void monitor() {
synchronized (this) { }
}
就是去获得ams对象本身,为什么这样设计这个接口?
因为在ActivityManagerService的许多实际工作函数入口就会首先获得ams对象这把锁,所以monitor这个函数的实际意义就是监控ActivityManagerService陷入到某个操作多长时间,如果陷入时间超时,这个monitor函数肯定不会返回,就可以触发watchdog开启重置功能。
public class Watchdog extends Thread {
......
}
WD本质上是一个线程,同时它在system_server中是单例的。
在System_Server中:
Watchdog.getInstance().start();
通过这样的形式将WD起起来,然后我们看这个线程的主loop在做什么:
代码略长,先给出每个阶段示意:
线程一直都在一个大的while循环中
1.遍历WD持有的所有handlerchecker对象,让每个checker分别进行检测工作
2.线程等待一个check周期时间
3.计算当前各个service的状态
4.根据第3步计算出来的状态做决策,是继续check还是判定为anr条件成立,开始杀进程
//thread 主运行
@Override
public void run() {
boolean waitedHalf = false;
while (true) {
final ArrayList<HandlerChecker> blockedCheckers;
final String subject;
final boolean allowRestart;
int debuggerWasConnected = 0;
synchronized (this) {
long timeout = CHECK_INTERVAL;
// Make sure we (re)spin the checkers that have become idle within
// this wait-and-check interval
/*step1.遍历每个checker,让其执行自身的检测工作*/
for (int i=0; i<mHandlerCheckers.size(); i++) {
HandlerChecker hc = mHandlerCheckers.get(i);
hc.scheduleCheckLocked();
}
......
// NOTE: We use uptimeMillis() here because we do not want to increment the time we
// wait while asleep. If the device is asleep then the thing that we are waiting
// to timeout on is asleep as well and won't have a chance to run, causing a false
// positive on when to kill things.
/*step2.线程等待一个check周期*/
long start = SystemClock.uptimeMillis();
while (timeout > 0) {//timeout是半分钟
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
try {
wait(timeout);//等待半分钟
} catch (InterruptedException e) {
Log.wtf(TAG, e);
}
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);//update timeout
}
/*step3.计算所有checker的状态,本质上就是三种状态:1.未block2.等待了半个周期3.等待了整个周期。之所以会设计半个和整个周期的区别,watchdog还会放卡住半个周期的Service一马,再宽限其半个周期*/
final int waitState = evaluateCheckerCompletionLocked();
/*step4.根据第3步计算出的状态,做出抉择,继续下次loop,还是判断当前已经anr准备杀!*/
if (waitState == COMPLETED) {//前半分钟已经确定各个checker都ok的情况下,再跳回去loop,再反复一直check
// The monitors have returned; reset
waitedHalf = false;
continue;
} else if (waitState == WAITING) {//如果这个半分钟之内存在checker都是处于waiting状态
// still waiting but within their configured intervals; back off and recheck
continue;
} else if (waitState == WAITED_HALF) {
if (!waitedHalf) {
// We've waited half the deadlock-detection interval. Pull a stack
// trace and wait another half.
ArrayList<Integer> pids = new ArrayList<Integer>();
pids.add(Process.myPid());
/*调用AMS的dumpstacktrace函数生成traces.txt文件*/
ActivityManagerService.dumpStackTraces(true, pids, null, null,
NATIVE_STACKS_OF_INTEREST);
waitedHalf = true;
}
continue;
}
// something is overdue!
/*查出是那个checker block住,同时生成一个string subject用于log输出*/
blockedCheckers = getBlockedCheckersLocked();
subject = describeCheckersLocked(blockedCheckers);
allowRestart = mAllowRestart;
}
// If we got here, that means that the system is most likely hung.
// First collect stack traces from all threads of the system process.
// Then kill this process so that the system will restart.
EventLog.writeEvent(EventLogTags.WATCHDOG, subject);
ArrayList<Integer> pids = new ArrayList<Integer>();
pids.add(Process.myPid());
if (mPhonePid > 0) pids.add(mPhonePid);
// Pass !waitedHalf so that just in case we somehow wind up here without having
// dumped the halfway stacks, we properly re-initialize the trace file.
final File stack = ActivityManagerService.dumpStackTraces(
!waitedHalf, pids, null, null, NATIVE_STACKS_OF_INTEREST);
......
/*后续完成一些收尾工作打印WD log的情况,好多看不太懂,不算是核心思想就略过*/
}
现在的核心就转为研究HandlerChecker上
public final class HandlerChecker implements Runnable {
}
HandlerChecker实现runnable接口,本质上就是能够往looper上丢,然后自我运行
@Override
public void run() {//这个checker
final int size = mMonitors.size();
for (int i = 0 ; i < size ; i++) {
synchronized (Watchdog.this) {
mCurrentMonitor = mMonitors.get(i);
}
mCurrentMonitor.monitor();
}
//前面这里没有被卡死,就会到这里把COMPLETED设置为true
synchronized (Watchdog.this) {
mCompleted = true;
mCurrentMonitor = null;
}
}
这个runnable就会遍历所有monitor,调用这些monitor的monitor方法。
嘿嘿,其实这些monitor就是各个ams、wms,这些service的都继承了WD的monitor方法,这些service起来的时候就会获取wd的单例,然后addmonitor将自己加进去。
所以watchdog在检测system_server的稳定运行中就会这样起作用。但是system_server出现anr是不会弹出dialog的,只会直接死机掉。然后app检测到binder的server端挂掉后,自己也纷纷挂掉,然后系统就呵呵了,就会自动重启了。