Android ANR源码原理分析

 

ANR即Application Not Responding,顾名思义就是应用程序无响应。在Android中,一般情况下,四大组件均是工作在主线程中的,Android中的Activity Manager和Window Manager会随时监控应用程序的响应情况,如果因为一些耗时操作(网络请求或者IO操作)造成主线程阻塞一定时间(例如造成5s内不能响应用户事件或者BroadcastReceiver的onReceive方法执行时间超过10s),那么系统就会显示ANR对话框提示用户对应的应用处于无响应状态。

虽然每个程序员都不想ANR发生在自己的头上,因此,你需要严格遵守Google提供的一系列建议(看这里),简单总结就是以下两点:

1. 不要让主线程干耗时的工作
2. 不要让其他线程阻塞主线程的执行

因此,要尽量保证主线程执行工作干净利落,一个消息循环执行时间最好不超过100ms到200ms,对于一些脏活累活可以交给AsyncTask、HandlerThread、IntentService或者另外起的新线程来完成,这样应用程序就能够及时响应用户的操作而不会给用户带来卡顿的感觉。

 

一、概述

ANR(Application Not responding),是指应用程序未响应,Android系统对于一些事件需要在一定的时间范围内完成,如果超过预定时间能未能得到有效响应或者响应时间过长,都会造成ANR。一般地,这时往往会弹出一个提示框,告知用户当前xxx未响应,用户可选择继续等待或者Force Close。

那么哪些场景会造成ANR呢?

  • Service Timeout:服务在20s内未执行完成;
  • BroadcastQueue Timeout:比如前台广播在10s内执行完成
  • ContentProvider Timeout:内容提供者执行超时
  • inputDispatching Timeout: 输入事件分发超时5s,包括按键分发事件的超时。

二、ANR触发时机

2.1 Service Timeout

Service Timeout触发时机,简单说就是AMS中的mHandler收到SERVICE_TIMEOUT_MSG消息时触发。

在前面文章startService流程分析详细介绍Service启动流程,在Service所在进程attach到system_server进程的过程中会调用realStartServiceLocked()方法

2.1.1 realStartServiceLocked

[-> ActiveServices.java]

private final void realStartServiceLocked(ServiceRecord r, ProcessRecord app, boolean execInFg) throws RemoteException { ... //发送delay消息(SERVICE_TIMEOUT_MSG),【见小节2.1.2】 bumpServiceExecutingLocked(r, execInFg, "create"); try { ... //最终执行服务的onCreate()方法 app.thread.scheduleCreateService(r, r.serviceInfo, mAm.compatibilityInfoForPackageLocked(r.serviceInfo.applicationInfo), app.repProcState); } catch (DeadObjectException e) { ... } finally { if (!created) { //当service启动完毕,则remove SERVICE_TIMEOUT_MSG消息【见小节2.1.3】 serviceDoneExecutingLocked(r, inDestroying, inDestroying); ... } } } 

2.1.2 bumpServiceExecutingLocked

该方法的主要工作发送delay消息(SERVICE_TIMEOUT_MSG)

private final void bumpServiceExecutingLocked(ServiceRecord r, boolean fg, String why) { ... scheduleServiceTimeoutLocked(r.app); } void scheduleServiceTimeoutLocked(ProcessRecord proc) { if (proc.executingServices.size() == 0 || proc.thread == null) { return; } long now = SystemClock.uptimeMillis(); Message msg = mAm.mHandler.obtainMessage( ActivityManagerService.SERVICE_TIMEOUT_MSG); msg.obj = proc; //当超时后仍没有remove该SERVICE_TIMEOUT_MSG消息,则执行service Timeout流程【见2.1.4】 mAm.mHandler.sendMessageAtTime(msg, proc.execServicesFg ? (now+SERVICE_TIMEOUT) : (now+ SERVICE_BACKGROUND_TIMEOUT)); } 
  • 对于前台服务,则超时为SERVICE_TIMEOUT,即timeout=20s;
  • 对于后台服务,则超时为SERVICE_BACKGROUND_TIMEOUT,即timeout=200s;

2.1.3 serviceDoneExecutingLocked

该方法的主要工作是当service启动完成,则移除service Timeout消息。

private void serviceDoneExecutingLocked(ServiceRecord r, boolean inDestroying, boolean finishing) { ... if (r.executeNesting <= 0) { if (r.app != null) { r.app.execServicesFg = false; r.app.executingServices.remove(r); if (r.app.executingServices.size() == 0) { //当前服务所在进程中没有正在执行的service mAm.mHandler.removeMessages(ActivityManagerService.SERVICE_TIMEOUT_MSG, r.app); ... } ... } 

2.1.4 SERVICE_TIMEOUT_MSG

到此不难理解,当SERVICE_TIMEOUT_MSG消息成功发送时,则AMS中的mHandler收到该消息则触发调用serviceTimeout

final class MainHandler extends Handler { public void handleMessage(Message msg) { switch (msg.what) { case SERVICE_TIMEOUT_MSG: { ... //【见小节2.1.5】 mServices.serviceTimeout((ProcessRecord)msg.obj); } break; ... } ... } } 

2.1.5 serviceTimeout

[-> ActiveServices.java]

void serviceTimeout(ProcessRecord proc) { String anrMessage = null; synchronized(mAm) { if (proc.executingServices.size() == 0 || proc.thread == null) { return; } final long now = SystemClock.uptimeMillis(); final long maxTime = now - (proc.execServicesFg ? SERVICE_TIMEOUT : SERVICE_BACKGROUND_TIMEOUT); ServiceRecord timeout = null; long nextTime = 0; for (int i=proc.executingServices.size()-1; i>=0; i--) { ServiceRecord sr = proc.executingServices.valueAt(i); if (sr.executingStart < maxTime) { timeout = sr; break; } if (sr.executingStart > nextTime) { nextTime = sr.executingStart; } } if (timeout != null && mAm.mLruProcesses.contains(proc)) { Slog.w(TAG, "Timeout executing service: " + timeout); StringWriter sw = new StringWriter(); PrintWriter pw = new FastPrintWriter(sw, false, 1024); pw.println(timeout); timeout.dump(pw, " "); pw.close(); mLastAnrDump = sw.toString(); mAm.mHandler.removeCallbacks(mLastAnrDumpClearer); mAm.mHandler.postDelayed(mLastAnrDumpClearer, LAST_ANR_LIFETIME_DURATION_MSECS); anrMessage = "executing service " + timeout.shortName; } } if (anrMessage != null) { //当存在timeout的service,则执行appNotResponding【见小节3.1】 mAm.appNotResponding(proc, null, null, false, anrMessage); } } 

其中anrMessage的内容为”executing service [发送超时serviceRecord信息]”;

2.2 BroadcastQueue Timeout

BroadcastQueue Timeout触发时机,简单说就是BroadcastQueue中的mHandler收到BROADCAST_TIMEOUT_MSG消息时触发。

在前面文章Android Broadcast广播机制分析详细介绍广播启动流程,在发送广播过程中会执行scheduleBroadcastsLocked方法来处理相关的广播,然后会调用到processNextBroadcast方法来处理下一条广播。

processNextBroadcast执行过程分4步骤:

  • step1. 处理并行广播
  • step2. 处理当前有序广播
  • step3. 获取下条有序广播
  • step4. 处理下条有序广播

2.2.1 processNextBroadcast

[-> BroadcastQueue.java]

final void processNextBroadcast(boolean fromMsg) { synchronized(mService) { ... //step 2: 处理当前有序广播 do { r = mOrderedBroadcasts.get(0); //获取所有该广播所有的接收者 int numReceivers = (r.receivers != null) ? r.receivers.size() : 0; if (mService.mProcessesReady && r.dispatchTime > 0) { long now = SystemClock.uptimeMillis(); if ((numReceivers > 0) && (now > r.dispatchTime + (2*mTimeoutPeriod*numReceivers))) { //当广播处理时间超时,则强制结束这条广播【见小节2.2.5】 broadcastTimeoutLocked(false); ... } } if (r.receivers == null || r.nextReceiver >= numReceivers || r.resultAbort || forceReceive) { if (r.resultTo != null) { //处理广播消息消息 performReceiveLocked(r.callerApp, r.resultTo, new Intent(r.intent), r.resultCode, r.resultData, r.resultExtras, false, false, r.userId); r.resultTo = null; } //取消BROADCAST_TIMEOUT_MSG消息【见小节2.2.3】 cancelBroadcastTimeoutLocked(); } } while (r == null); ... //step 3: 获取下条有序广播 r.receiverTime = SystemClock.uptimeMillis(); if (!mPendingBroadcastTimeoutMessage) { long timeoutTime = r.receiverTime + mTimeoutPeriod; //设置广播超时时间,发送BROADCAST_TIMEOUT_MSG【见小节2.2.2】 setBroadcastTimeoutLocked(timeoutTime); } ... } } 

对于广播超时处理时机:

  1. 首先在step3的过程中setBroadcastTimeoutLocked(timeoutTime) 设置超时广播消息;
  2. 然后在step2根据广播处理情况来处理:
    • 当广播接收者等待时间过长,则调用broadcastTimeoutLocked(false);
    • 当,cancelBroadcastTimeoutLocked

2.2.2 setBroadcastTimeoutLocked

final void setBroadcastTimeoutLocked(long timeoutTime) { if (! mPendingBroadcastTimeoutMessage) { Message msg = mHandler.obtainMessage(BROADCAST_TIMEOUT_MSG, this); mHandler.sendMessageAtTime(msg, timeoutTime); mPendingBroadcastTimeoutMessage = true; } } 

设置定时广播BROADCAST_TIMEOUT_MSG,即当前往后推mTimeoutPeriod时间广播还没处理完毕,则进入广播超时流程。

  • 对于前台广播,则超时为BROADCAST_FG_TIMEOUT,即timeout=10s;
  • 对于后台广播,则超时为BROADCAST_BG_TIMEOUT,即timeout=60s。

2.2.3 cancelBroadcastTimeoutLocked

final void cancelBroadcastTimeoutLocked() { if (mPendingBroadcastTimeoutMessage) { mHandler.removeMessages(BROADCAST_TIMEOUT_MSG, this); mPendingBroadcastTimeoutMessage = false; } } 

移除广播超时消息BROADCAST_TIMEOUT_MSG

2.2.4 BROADCAST_TIMEOUT_MSG

到此不难理解,当BROADCAST_TIMEOUT_MSG消息成功发送时,则AMS中的mHandler收到该消息则触发调用serviceTimeout

private final class BroadcastHandler extends Handler { public void handleMessage(Message msg) { switch (msg.what) { case BROADCAST_TIMEOUT_MSG: { synchronized (mService) { //【见小节2.2.5】 broadcastTimeoutLocked(true); } } break; ... } ... } } 

2.2.5 broadcastTimeoutLocked

[-> BroadcastRecord.java]

final void broadcastTimeoutLocked(boolean fromMsg) { if (fromMsg) { mPendingBroadcastTimeoutMessage = false; } if (mOrderedBroadcasts.size() == 0) { return; } long now = SystemClock.uptimeMillis(); BroadcastRecord r = mOrderedBroadcasts.get(0); if (fromMsg) { if (mService.mDidDexOpt) { //延迟timeouts直到dexopt结束 mService.mDidDexOpt = false; long timeoutTime = SystemClock.uptimeMillis() + mTimeoutPeriod; setBroadcastTimeoutLocked(timeoutTime); return; } if (!mService.mProcessesReady) { //当系统还没有准备就绪时,广播处理流程中不存在广播超时 return; } long timeoutTime = r.receiverTime + mTimeoutPeriod; if (timeoutTime > now) { //过早的timeout,重新设置广播超时 setBroadcastTimeoutLocked(timeoutTime); return; } } BroadcastRecord br = mOrderedBroadcasts.get(0); if (br.state == BroadcastRecord.WAITING_SERVICES) { //广播已经处理完成,但需要等待已启动service执行完成。当等待足够时间,则处理下一条广播。 br.curComponent = null; br.state = BroadcastRecord.IDLE; processNextBroadcast(false); return; } r.receiverTime = now; //当前BroadcastRecord的anr次数执行加1操作 r.anrCount++; if (r.nextReceiver <= 0) { return; } ProcessRecord app = null; String anrMessage = null; Object curReceiver = r.receivers.get(r.nextReceiver-1); //根据情况记录广播接收者丢弃的EventLog logBroadcastReceiverDiscardLocked(r); if (curReceiver instanceof BroadcastFilter) { BroadcastFilter bf = (BroadcastFilter)curReceiver; if (bf.receiverList.pid != 0 && bf.receiverList.pid != ActivityManagerService.MY_PID) { synchronized (mService.mPidsSelfLocked) { app = mService.mPidsSelfLocked.get( bf.receiverList.pid); } } } else { app = r.curApp; } if (app != null) { anrMessage = "Broadcast of " + r.intent.toString(); } if (mPendingBroadcast == r) { mPendingBroadcast = null; } //继续移动到下一个广播接收者 finishReceiverLocked(r, r.resultCode, r.resultData, r.resultExtras, r.resultAbort, false); scheduleBroadcastsLocked(); if (anrMessage != null) { //【见小节2.2.6】 mHandler.post(new AppNotResponding(app, anrMessage)); } } 

2.2.6 AppNotResponding

[-> BroadcastQueue.java]

private final class AppNotResponding implements Runnable { ... public void run() { //【见小节3.1】 mService.appNotResponding(mApp, null, null, false, mAnnotation); } } 

2.3 ContentProvider Timeout

2.3.1 AMS.appNotRespondingViaProvider

public void appNotRespondingViaProvider(IBinder connection) { enforceCallingPermission( android.Manifest.permission.REMOVE_TASKS, "appNotRespondingViaProvider()"); final ContentProviderConnection conn = (ContentProviderConnection) connection; if (conn == null) { return; } final ProcessRecord host = conn.provider.proc; //无法找到provider所处的进程 if (host == null) { return; } final long token = Binder.clearCallingIdentity(); try { //【见小节3.1】 appNotResponding(host, null, null, false, "ContentProvider not responding"); } finally { Binder.restoreCallingIdentity(token); } } 

Timeout时间20s

调用链:

ContentProviderClient.NotRespondingRunnable.run ContextImpl.ApplicationContentResolver.appNotRespondingViaProvider ActivityThread.appNotRespondingViaProvider AMP.appNotRespondingViaProvider AMS.appNotRespondingViaProvider 

2.4 inputDispatching Timeout

在native层InputDispatcher.cpp中经过层层调用,此处先省略过程,后续再展开,从native层com_android_server_input_InputManagerService调用到java层InputManagerService。

2.4.1 IMS.notifyANR

[-> InputManagerService.java]

private long notifyANR(InputApplicationHandle inputApplicationHandle, InputWindowHandle inputWindowHandle, String reason) { //【见小节2.4.2】 return mWindowManagerCallbacks.notifyANR( inputApplicationHandle, inputWindowHandle, reason); } 

mWindowManagerCallbacks为InputMonitor对象

2.4.2 notifyANR

[-> InputMonitor.java]

public long notifyANR(InputApplicationHandle inputApplicationHandle, InputWindowHandle inputWindowHandle, String reason) { AppWindowToken appWindowToken = null; WindowState windowState = null; boolean aboveSystem = false; synchronized (mService.mWindowMap) { if (inputWindowHandle != null) { windowState = (WindowState) inputWindowHandle.windowState; if (windowState != null) { appWindowToken = windowState.mAppToken; } } if (appWindowToken == null && inputApplicationHandle != null) { appWindowToken = (AppWindowToken)inputApplicationHandle.appWindowToken; } //输出input事件分发超时log if (windowState != null) { Slog.i(WindowManagerService.TAG, "Input event dispatching timed out " + "sending to " + windowState.mAttrs.getTitle() + ". Reason: " + reason); int systemAlertLayer = mService.mPolicy.windowTypeToLayerLw( WindowManager.LayoutParams.TYPE_SYSTEM_ALERT); aboveSystem = windowState.mBaseLayer > systemAlertLayer; } else if (appWindowToken != null) { Slog.i(WindowManagerService.TAG, "Input event dispatching timed out " + "sending to application " + appWindowToken.stringName + ". Reason: " + reason); } else { Slog.i(WindowManagerService.TAG, "Input event dispatching timed out " + ". Reason: " + reason); } mService.saveANRStateLocked(appWindowToken, windowState, reason); } if (appWindowToken != null && appWindowToken.appToken != null) { //【见小节2.5.1】 boolean abort = appWindowToken.appToken.keyDispatchingTimedOut(reason); if (! abort) { return appWindowToken.inputDispatchingTimeoutNanos; } } else if (windowState != null) { //AMP经过binder,最终调用到AMS【见小节2.4.3】 long timeout = ActivityManagerNative.getDefault().inputDispatchingTimedOut( windowState.mSession.mPid, aboveSystem, reason); if (timeout >= 0) { return timeout * 1000000L; //转化为纳秒 } } return 0; } 

2.4.3 AMS.inputDispatchingTimedOut

public long inputDispatchingTimedOut(int pid, final boolean aboveSystem, String reason) { ... ProcessRecord proc; long timeout; synchronized (this) { synchronized (mPidsSelfLocked) { proc = mPidsSelfLocked.get(pid); //根据pid查看进程record } timeout = getInputDispatchingTimeoutLocked(proc); } //【见小节2.4.4】 if (!inputDispatchingTimedOut(proc, null, null, aboveSystem, reason)) { return -1; } return timeout; } 

inputDispatching的超时为KEY_DISPATCHING_TIMEOUT,即timeout = 5s

2.4.4 AMS.inputDispatchingTimedOut

public boolean inputDispatchingTimedOut(final ProcessRecord proc, final ActivityRecord activity, final ActivityRecord parent, final boolean aboveSystem, String reason) { ... final String annotation; if (reason == null) { annotation = "Input dispatching timed out"; } else { annotation = "Input dispatching timed out (" + reason + ")"; } if (proc != null) { ... mHandler.post(new Runnable() { public void run() { //【见小节3.1】 appNotResponding(proc, activity, parent, aboveSystem, annotation); } }); } return true; } 

调用链:

InputManagerService.notifyANR InputMonitor.notifyANR AMP.inputDispatchingTimedOut AMS.inputDispatchingTimedOut 

2.5 keyDispatching Timeout

keyDispatching timout与inputDispatching Timeout流畅基本一致。

调用链:

InputManagerService.notifyANR InputMonitor.notifyANR ActivityRecord.Token.keyDispatchingTimedOut AMS.inputDispatchingTimedOut 

Token.keyDispatchingTimedOut

[-> ActivityRecord.java]

final class ActivityRecord { static class Token extends IApplicationToken.Stub { public boolean keyDispatchingTimedOut(String reason) { ActivityRecord r; ActivityRecord anrActivity; ProcessRecord anrApp; synchronized (mService) { r = tokenToActivityRecordLocked(this); if (r == null) { return false; } anrActivity = r.getWaitingHistoryRecordLocked(); anrApp = r != null ? r.app : null; } return mService.inputDispatchingTimedOut(anrApp, anrActivity, r, false, reason); } ... } } 

对于keyDispatching Timeout的ANR,当触发该类型ANR时,如果不再有输入事件,则不会弹出ANR对话框;只有在下一次input事件产生后5s才弹出ANR提示框。

三、ANR工作

3.1 appNotResponding

[-> ActivityManagerService.java]

final void appNotResponding(ProcessRecord app, ActivityRecord activity, ActivityRecord parent, boolean aboveSystem, final String annotation) { ArrayList<Integer> firstPids = new ArrayList<Integer>(5); SparseArray<Boolean> lastPids = new SparseArray<Boolean>(20); if (mController != null) { try { // 0 == continue, -1 = kill process immediately int res = mController.appEarlyNotResponding(app.processName, app.pid, annotation); if (res < 0 && app.pid != MY_PID) { app.kill("anr", true); } } catch (RemoteException e) { mController = null; Watchdog.getInstance().setActivityController(null); } } long anrTime = SystemClock.uptimeMillis(); if (MONITOR_CPU_USAGE) { updateCpuStatsNow(); } synchronized (this) { // PowerManager.reboot() 会阻塞很长时间,因此忽略关机时的ANR if (mShuttingDown) { Slog.i(TAG, "During shutdown skipping ANR: " + app + " " + annotation); return; } else if (app.notResponding) { Slog.i(TAG, "Skipping duplicate ANR: " + app + " " + annotation); return; } else if (app.crashing) { Slog.i(TAG, "Crashing app skipping ANR: " + app + " " + annotation); return; } app.notResponding = true; //记录ANR EventLog.writeEvent(EventLogTags.AM_ANR, app.userId, app.pid, app.processName, app.info.flags, annotation); // Dump thread traces as quickly as we can, starting with "interesting" processes. firstPids.add(app.pid); int parentPid = app.pid; if (parent != null && parent.app != null && parent.app.pid > 0) parentPid = parent.app.pid; if (parentPid != app.pid) firstPids.add(parentPid); if (MY_PID != app.pid && MY_PID != parentPid) firstPids.add(MY_PID); for (int i = mLruProcesses.size() - 1; i >= 0; i--) { ProcessRecord r = mLruProcesses.get(i); if (r != null && r.thread != null) { int pid = r.pid; if (pid > 0 && pid != app.pid && pid != parentPid && pid != MY_PID) { if (r.persistent) { firstPids.add(pid); } else { lastPids.put(pid, Boolean.TRUE); } } } } } //输出ANR到main log. StringBuilder info = new StringBuilder(); info.setLength(0); info.append("ANR in ").append(app.processName); if (activity != null && activity.shortComponentName != null) { info.append(" (").append(activity.shortComponentName).append(")"); } info.append("\n"); info.append("PID: ").append(app.pid).append("\n"); if (annotation != null) { info.append("Reason: ").append(annotation).append("\n"); } if (parent != null && parent != activity) { info.append("Parent: ").append(parent.shortComponentName).append("\n"); } final ProcessCpuTracker processCpuTracker = new ProcessCpuTracker(true); //dump栈信息 File tracesFile = dumpStackTraces(true, firstPids, processCpuTracker, lastPids, NATIVE_STACKS_OF_INTEREST); String cpuInfo = null; if (MONITOR_CPU_USAGE) { updateCpuStatsNow(); synchronized (mProcessCpuTracker) { //输出各个进程的CPU使用情况 cpuInfo = mProcessCpuTracker.printCurrentState(anrTime); } //输出CPU负载 info.append(processCpuTracker.printCurrentLoad()); info.append(cpuInfo); } info.append(processCpuTracker.printCurrentState(anrTime)); Slog.e(TAG, info.toString()); if (tracesFile == null) { //发送signal 3来dump栈信息 Process.sendSignal(app.pid, Process.SIGNAL_QUIT); } //将anr信息添加到dropbox addErrorToDropBox("anr", app, app.processName, activity, parent, annotation, cpuInfo, tracesFile, null); if (mController != null) { try { // 0 == show dialog, 1 = keep waiting, -1 = kill process immediately int res = mController.appNotResponding(app.processName, app.pid, info.toString()); if (res != 0) { if (res < 0 && app.pid != MY_PID) { app.kill("anr", true); } else { synchronized (this) { mServices.scheduleServiceTimeoutLocked(app); } } return; } } catch (RemoteException e) { mController = null; Watchdog.getInstance().setActivityController(null); } } boolean showBackground = Settings.Secure.getInt(mContext.getContentResolver(), Settings.Secure.ANR_SHOW_BACKGROUND, 0) != 0; synchronized (this) { mBatteryStatsService.noteProcessAnr(app.processName, app.uid); if (!showBackground && !app.isInterestingToUserLocked() && app.pid != MY_PID) { app.kill("bg anr", true); return; } // Set the app's notResponding state, and look up the errorReportReceiver makeAppNotRespondingLocked(app, activity != null ? activity.shortComponentName : null, annotation != null ? "ANR " + annotation : "ANR", info.toString()); //弹出ANR对话框 Message msg = Message.obtain(); HashMap<String, Object> map = new HashMap<String, Object>(); msg.what = SHOW_NOT_RESPONDING_MSG; msg.obj = map; msg.arg1 = aboveSystem ? 1 : 0; map.put("app", app); if (activity != null) { map.put("activity", activity); } mUiHandler.sendMessage(msg); } } 

主要发送ANR, 则会输出

  • 各个进程的CPU使用情况;
  • CPU负载;
  • IOWait;
  • traces文件

四、其他

导致ANR常见情形:

  • I/O阻塞
  • 网络阻塞;
  • onReceiver执行时间超过10s;
  • 多线程死锁

避免ANR:

  • UI线程尽量只做跟UI相关的工作
  • 耗时的工作()比如数据库操作,I/O,网络操作),采用单独的工作线程处理
  • 用Handler来处理UIthread和工作thread的交互

UI线程,例如:

  • Activity:onCreate(), onResume(), onDestroy(), onKeyDown(), onClick(),etc
  • AsyncTask: onPreExecute(), onProgressUpdate(), onPostExecute(), onCancel,etc
  • Mainthread handler: handleMessage(), post*(runnable r), etc

ANR分析:需要关注CPU/IO,trace死锁等数据。

 

 

 

五、遇到ANR怎么分析

导出trace文件

 

如果ANR发生,对应的应用会收到SIGQUIT异常终止信号,dalvik虚拟机就会自动在/data/anr/目录下生成trace.txt文件,这个文件记录了在发生ANR时刻系统各个线程的执行状态,获取这个文件是不需要root权限的,因此首先需要做的就是通过adb pull命令将这个文件导出并等待分析。

trace文件格式解析

导出trace文件后,可以看到类似于如下的文件内容:

----- pid 901 at 2015-11-28 14:38:34 -----
Cmd line: system_server

JNI: CheckJNI is off; workarounds are off; pins=6; globals=2154 (plus 409 weak)

DALVIK THREADS:
(mutexes: tll=0 tsl=0 tscl=0 ghl=0)

"main" prio=5 tid=1 NATIVE
  | group="main" sCount=1 dsCount=0 obj=0x415a4e88 self=0x414c48d8
  | sysTid=901 nice=-2 sched=0/0 cgrp=apps handle=1073926484
  | state=S schedstat=( 303590361913 618664734427 651535 ) utm=19466 stm=10893 core=0
  #00  pc 00021914  /system/lib/libc.so (epoll_wait+12)
  #01  pc 0001065f  /system/lib/libutils.so (android::Looper::pollInner(int)+98)
  #02  pc 00010889  /system/lib/libutils.so (android::Looper::pollOnce(int, int*, int*, void**)+92)
  #03  pc 0006b771  /system/lib/libandroid_runtime.so (android::NativeMessageQueue::pollOnce(_JNIEnv*, int)+22)
  #04  pc 0002034c  /system/lib/libdvm.so (dvmPlatformInvoke+112)
  #05  pc 00050fcf  /system/lib/libdvm.so (dvmCallJNIMethod(unsigned int const*, JValue*, Method const*, Thread*)+398)
  #06  pc 00000214  /dev/ashmem/dalvik-jit-code-cache (deleted)
  at android.os.MessageQueue.nativePollOnce(Native Method)
  at android.os.MessageQueue.next(MessageQueue.java:138)
  at android.os.Looper.loop(Looper.java:196)
  at com.android.server.ServerThread.initAndLoop(SystemServer.java:1174)
  at com.android.server.SystemServer.main(SystemServer.java:1271)
  at java.lang.reflect.Method.invokeNative(Native Method)
  at java.lang.reflect.Method.invoke(Method.java:515)
  at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:878)
  at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:694)
  at dalvik.system.NativeStart.main(Native Method)

首先需要了解这些参数表示的意义,我们挑其中关键的几个说明:

  • 第一二行

    ----- pid 901 at 2015-11-28 14:38:34 -----
    Cmd line: system_server
    

    说明了发生ANR的进程id、时间和进程名称。

  • 后面三行是线程的基本信息

    JNI: CheckJNI is off; workarounds are off; pins=6; globals=2154 (plus 409 weak)
    
    DALVIK THREADS:
    (mutexes: tll=0 tsl=0 tscl=0 ghl=0)
    

    其中tll、tsl、tscl、ghl、hwl、hwll分别对应:thread list lock, thread suspend lock, thread suspend count lock, gc heap lock, heap worker lock和heap worker list lock。

  • main prio=5 tid=1 NATIVE说明了线程名称、线程的优先级、线程锁id和线程状态。线程名称是启动线程的时候手动指明的,这里的main标识是主线程,是Android自动设定的一个线程名称,如果是自己手动创建的线程,一般会被命名成“Thread-xx”的格式,其中xx是线程id,它只增不减不会被复用;注意这其中的tid不是线程的id,它是一个在Java虚拟机中用来实现线程锁的变量,随着线程的增减,这个变量的值是可能被复用的;最后线程的状态还分为如下几种

    状态说明
    THREAD_ZOMBIE0TERMINATED
    THREAD_RUNNING1RUNNABLE or running now
    THREAD_TIMED_WAIT2TIMED_WAITING in Object.wait()
    THREAD_MONITOR3BLOCKED on a monitor
    THREAD_INITIALIZING5allocated not yet running
    THREAD_STARTING6started not yet on thread list
    THREAD_NATIVE7off in a JNI native method
    THREAD_VMWAIT8waiting on a VM resource
    THREAD_SUSPENDED9suspended usually by GC or debugger

    特别说明一下MONITOR状态和SUSPEND状态,MONITOR状态一般是类的同步块或者同步方法造成的,SUSPENDED状态在debugger的时候会出现,可以用来区别是不是真的是用户正常操作跑出了ANR。

  • 后面一行

    | group="main" sCount=1 dsCount=0 obj=0x415a4e88 self=0x414c48d8
    

    group是线程组名称。sCount是此线程被挂起的次数,dsCount是线程被调试器挂起的次数,当一个进程被调试后,sCount会重置为0,调试完毕后sCount会根据是否被正常挂起增长,但是dsCount不会被重置为0,所以dsCount也可以用来判断这个线程是否被调试过。obj表示这个线程的Java对象的地址,self表示这个线程本身的地址。

  • 此后是线程的调度信息

    sysTid=901 nice=-2 sched=0/0 cgrp=apps handle=1073926484
    

    sysTid是Linux下的内核线程id,nice是线程的调度优先级,sched分别标志了线程的调度策略和优先级,cgrp是调度属组,handle是线程的处理函数地址。

  • 然后是线程当前上下文信息

    state=S schedstat=( 303590361913 618664734427 651535 ) utm=19466 stm=10893 core=0
    

    state是调度状态;schedstat从 /proc/[pid]/task/[tid]/schedstat读出,三个值分别表示线程在cpu上执行的时间、线程的等待时间和线程执行的时间片长度,有的android内核版本不支持这项信息,得到的三个值都是0;utm是线程用户态下使用的时间值(单位是jiffies);stm是内核态下的调度时间值;core是最后执行这个线程的cpu核的序号。

  • 最后就是这个线程的调用栈信息。

通过分析trace文件得到ANR信息

通过上面分析,可以看到trace文件的头部就包含了很多与该线程相关的信息,但是并不是每个信息我们都必须弄懂,排查ANR的时候只需要找到其中关键的几个信息即可。一般可以通过以下几个简单的方法来判断。

trace文件顶部的线程一般是ANR的元凶

这是一个简单的方法,但是大部分情况下都适用,可以通过这个方法来快速判断是否是自己的应用造成了本次ANR。说明以下,并不是trace文件包含的应用就一定是造成ANR的帮凶,应用出现在trace文件中,只能说明出现ANR的时候这个应用进程还活着,trace文件的顶部则是触发ANR的应用信息。因此,如果你的应用出现在了trace文件的顶部,那么很有可能是因为你的应用造成了ANR,否则是你的应用造成ANR的可能性不大,但是具体是不是还需要进一步分析。例如:

    ----- pid 1182 at 2015-11-26 01:53:34 -----
    Cmd line: system_server

    JNI: CheckJNI is off; workarounds are off; pins=5; globals=2982 (plus 135 weak)

    DALVIK THREADS:
    (mutexes: tll=0 tsl=0 tscl=0 ghl=0)

    "main" prio=5 tid=1 NATIVE
      | group="main" sCount=1 dsCount=0 obj=0x420a0e58 self=0x4208f918
      | sysTid=1182 nice=-2 sched=0/0 cgrp=apps handle=1074594132
      | state=S schedstat=( 211672310629 149959255867 472114 ) utm=13047 stm=8120 core=1
      #00  pc 000218b8  /system/lib/libc.so (epoll_wait+12)
      ...
     at android.os.MessageQueue.nativePollOnce(Native Method)
      at android.os.MessageQueue.next(MessageQueue.java:138)
      at android.os.Looper.loop(Looper.java:123)
      at com.android.server.ServerThread.initAndLoop(SystemServer.java:1213)
      at com.android.server.SystemServer.main(SystemServer.java:1317)
      at java.lang.reflect.Method.invokeNative(Native Method)
      at java.lang.reflect.Method.invoke(Method.java:515)
      at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:793)
      at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:609)
      at dalvik.system.NativeStart.main(Native Method)

      ...

    ----- end 1182 -----

     ----- pid 18927 at 2015-11-26 01:53:34 -----
    Cmd line: com.android.example

    JNI: CheckJNI is off; workarounds are off; pins=0; globals=465 (plus 984 weak)

    DALVIK THREADS:
    (mutexes: tll=0 tsl=0 tscl=0 ghl=0)

    "main" prio=5 tid=1 NATIVE
      | group="main" sCount=1 dsCount=0 obj=0x420a0e58 self=0x4208f918
      | sysTid=18927 nice=-6 sched=0/0 cgrp=apps handle=1074594132
      | state=S schedstat=( 7748840431407 1615931922290 9994018 ) utm=712375 stm=62509 core=1
      #00  pc 00020704  /system/lib/libc.so (__ioctl+8)
      #01  pc 0002cfa3  /system/lib/libc.so (ioctl+14)
      #02  pc 0001d3ed  /system/lib/libbinder.so (android::IPCThreadState::talkWithDriver(bool)+140)
      #03  pc 0001d8d7  /system/lib/libbinder.so (android::IPCThreadState::waitForResponse(android::Parcel*, int*)+42)
      #04  pc 0001dadf  /system/lib/libbinder.so (android::IPCThreadState::transact(int, unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+118)
      #05  pc 00019791  /system/lib/libbinder.so (android::BpBinder::transact(unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+30)
      ...
      #23  pc 00000d80  /system/bin/app_process
      at android.os.BinderProxy.transact(Native Method)
      at android.app.IAlarmManager$Stub$Proxy.set(IAlarmManager.java:154)
      at android.app.AlarmManager.setImpl(AlarmManager.java:369)
      at android.app.AlarmManager.setRepeating(AlarmManager.java:258)
      at android.os.Handler.handleCallback(Handler.java:733)
      at android.os.Handler.dispatchMessage(Handler.java:95)
      at android.os.Looper.loop(Looper.java:136)
      at android.app.ActivityThread.main(ActivityThread.java:5072)
      at java.lang.reflect.Method.invokeNative(Native Method)
      at java.lang.reflect.Method.invoke(Method.java:515)
      at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:793)
      at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:609)
      at dalvik.system.NativeStart.main(Native Method)

虽然应用com.android.example出现在了trace文件中,但是在ANR的时候它在通过IPCThread在进行进程间通信,而此次ANR发生于system_server获取用户事件的native方法里面,并不是我们的应用造成了ANR。又例如下面的trace文件顶部内容为:

    ----- pid 13406 at 2015-11-27 11:46:14 -----
    Cmd line: com.android.example

    JNI: CheckJNI is off; workarounds are off; pins=0; globals=536 (plus 102 weak)

    DALVIK THREADS:
    (mutexes: tll=0 tsl=0 tscl=0 ghl=0)

    "main" prio=5 tid=1 SUSPENDED
      | group="main" sCount=1 dsCount=0 obj=0x41795e58 self=0x416b58b0
      | sysTid=13406 nice=-6 sched=0/0 cgrp=apps handle=1074557268
      | state=S schedstat=( 2352435524847 736727917292 2633566 ) utm=213075 stm=22168 core=1
      at java.lang.String.<init>(String.java:~261)
      at java.util.zip.ZipEntry.<init>(ZipEntry.java:392)
      at java.util.zip.ZipFile.readCentralDir(ZipFile.java:414)
      at java.util.zip.ZipFile.<init>(ZipFile.java:151)
      at java.util.zip.ZipFile.<init>(ZipFile.java:123)
      at com.android.example.Utility.isValideFile(Utility.java:2700)
      at com.android.example.Test.getPath(Test.java:243)
      at com.android.example.Test$1.run(Test.java:531)
      at android.os.Handler.handleCallback(Handler.java:733)
      at android.os.Handler.dispatchMessage(Handler.java:95)
      at android.os.Looper.loop(Looper.java:136)
      at android.app.ActivityThread.main(ActivityThread.java:5050)
      at java.lang.reflect.Method.invokeNative(Native Method)
      at java.lang.reflect.Method.invoke(Method.java:515)
      at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:807)
      at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:623)
      at dalvik.system.NativeStart.main(Native Method)
      ...

这种情况说明ANR发生于com.android.example应用中,而且指明了ANR发生时代码的执行位置,这种情况十有八九就是我们应用程序的问题,之后就需要通过这个trace文件指明的路径来对代码进行排查。

注意死锁和等待

虽然说ANR一般情况是由于让主线程做了很多耗时的操作,但是死锁或者主线程等待也是ANR高发的原因,例如如下的trace:

----- pid 9436 at 2015-11-28 21:30:41 -----
Cmd line: com.example.yxz.myapplication

JNI: CheckJNI is off; workarounds are off; pins=0; globals=277

DALVIK THREADS:
(mutexes: tll=0 tsl=0 tscl=0 ghl=0)

"main" prio=5 tid=1 MONITOR
  | group="main" sCount=1 dsCount=0 obj=0x415a4e88 self=0x414c48d8
  | sysTid=9436 nice=0 sched=0/0 cgrp=apps handle=1073926484
  | state=S schedstat=( 671264662 337280259 1005 ) utm=53 stm=14 core=0
  at com.example.yxz.myapplication.performancetest.WaitANR$InnerMonitorClass.TimeConsumeFunc(WaitANR.java:~48)
  - waiting to lock <0x447a5670>  held by tid=11 (Thread-14208)
  at com.example.yxz.myapplication.performancetest.WaitANR$2.run(WaitANR.java:32)
  at android.os.Handler.handleCallback(Handler.java:733)
  at android.os.Handler.dispatchMessage(Handler.java:95)
  at android.os.Looper.loop(Looper.java:212)
  at android.app.ActivityThread.main(ActivityThread.java:5135)
  at java.lang.reflect.Method.invokeNative(Native Method)
  at java.lang.reflect.Method.invoke(Method.java:515)
  at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:878)
  at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:694)
  at dalvik.system.NativeStart.main(Native Method)

....

"Thread-14208" prio=10 tid=11 TIMED_WAIT
  | group="main" sCount=1 dsCount=0 obj=0x447a4b98 self=0x78296bb8
  | sysTid=9955 nice=-8 sched=0/0 cgrp=apps handle=2015978016
  | state=S schedstat=( 946045 640869 1 ) utm=0 stm=0 core=2
  at java.lang.VMThread.sleep(Native Method)
  at java.lang.Thread.sleep(Thread.java:1013)
  at java.util.concurrent.TimeUnit.sleep(TimeUnit.java:331)
  at com.example.yxz.myapplication.performancetest.WaitANR$InnerMonitorClass.TimeConsumeFunc(WaitANR.java:48)
  at com.example.yxz.myapplication.performancetest.WaitANR$1.run(WaitANR.java:20)
  at java.lang.Thread.run(Thread.java:841)

从trace文件可以看出,发生ANR的主线程正处于monitor状态,也就是它在等待一个synchronized块或者方法,但是目前这个monitor正在被tid=11的线程持有,所以造成了主线程被阻塞,从而发生了ANR。死锁的分析也是类似,发生死锁的线程一般处于MONITOR状态或者WAIT状态,等待其他进程的锁或者monitor,而其他进程又在等待另外线程的锁或者monitor,一直这样依赖下去,直到形成一个环。

    原文作者:Android源码分析
    原文地址: https://blog.csdn.net/jackwang_dev/article/details/78287651
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞