一、发生watchdog有两种情况:
1、system_server 特定进程特别繁忙,导致 watchdog ,无法向system_server 的特定线程post检测任务,导致的watchdog 超时,只有在system_server 特别繁忙(类似陷入死循环,或死锁),在长达1分钟的时间内,handler 无法处理新到来的一个任务的情况下才会发生此情况,比较少见
2、死锁, 之前处理的watchdog 大都是此中情况。
(ps:之前处理的watchdog,都是java层的死锁,只要堆栈全,很好处理,本次遇到了java 与native 的互锁)
二、发生问题
1、发生问题为:
相机大光圈使用音量减键拍照,然后自动重启
2、WatchDog 信息:
Blocked in monitor com.android.server.am.ActivityManagerService on foreground thread (android.fg), Blocked in handler on ui thread (android.ui), Blocked in handler on ActivityManager (ActivityManager)
3、Trace 信息以及分析:
----- pid 1185 at 2017-05-08 13:53:05 -----
Cmd line: system_server
"android.fg" prio=5 tid=14 Blocked
| group="main" sCount=1 dsCount=0 obj=0x12cc4cf0 self=0x7f94e92e00
| sysTid=1243 nice=0 cgrp=default sched=0/0 handle=0x7f8a4df440
| state=S schedstat=( 15683722127 21550750622 33957 ) utm=857 stm=711 core=0 HZ=100
| stack=0x7f8a3dd000-0x7f8a3df000 stackSize=1037KB
| held mutexes=
at com.android.server.am.ActivityManagerService.monitor(ActivityManagerService.java:24115)
- waiting to lock <0x0b129b42> (a com.android.server.am.ActivityManagerService) held by thread 99
at com.android.server.Watchdog$HandlerChecker.run(Watchdog.java:214)
at android.os.Handler.handleCallback(Handler.java:815)
at android.os.Handler.dispatchMessage(Handler.java:104)
at android.os.Looper.loop(Looper.java:207)
at android.os.HandlerThread.run(HandlerThread.java:61)
at com.android.server.ServiceThread.run(ServiceThread.java:46)
由以上堆栈可以看出,android.fg 正在等待,tid=99 线程的 0x0b129b42 锁。(android.ui 与ActivityManager 也是在等待这个锁)
"Binder_C" prio=5 tid=99 Native
| group="main" sCount=1 dsCount=0 obj=0x13a6d0a0 self=0x7f87e4c800
| sysTid=2194 nice=-4 cgrp=default sched=0/0 handle=0x7f7f7f1440
| state=S schedstat=( 167348256976 82664258066 457250 ) utm=10416 stm=6318 core=9 HZ=100
| stack=0x7f7f6f5000-0x7f7f6f7000 stackSize=1013KB
| held mutexes=
kernel: (couldn't read /proc/self/task/2194/stack)
native: #00 pc 000000000001c02c /system/lib64/libc.so (syscall+28)
native: #01 pc 0000000000068394 /system/lib64/libc.so (_ZL33__pthread_mutex_lock_with_timeoutP24pthread_mutex_internal_tPK8timespeci.constprop.0+484)
native: #02 pc 000000000006863c /system/lib64/libc.so (pthread_mutex_lock+36)
native: #03 pc 000000000002eb48 /system/lib64/libinputflinger.so (_ZN7android15InputDispatcher15setInputWindowsERKNS_6VectorINS_2spINS_17InputWindowHandleEEEEE+88)
native: #04 pc 0000000000014c44 /system/lib64/libandroid_servers.so (_ZN7android18NativeInputManager15setInputWindowsEP7_JNIEnvP13_jobjectArray+336)
native: #05 pc 00000000007911ec /system/framework/oat/arm64/services.odex (Java_com_android_server_input_InputManagerService_nativeSetInputWindows__J_3Lcom_android_server_input_InputWindowHandle_2+160)
at com.android.server.input.InputManagerService.nativeSetInputWindows(Native method)
at com.android.server.input.InputManagerService.setInputWindows(InputManagerService.java:1212)
at com.android.server.wm.InputMonitor.updateInputWindowsLw(InputMonitor.java:414)
at com.android.server.wm.InputMonitor.resumeDispatchingLw(InputMonitor.java:576)
at com.android.server.wm.WindowManagerService.resumeKeyDispatching(WindowManagerService.java:8412)
- locked <0x03a77890> (a java.util.HashMap)
at com.android.server.am.ActivityRecord.resumeKeyDispatchingLocked(ActivityRecord.java:1028)
at com.android.server.am.ActivityStack.finishCurrentActivityLocked(ActivityStack.java:3642)
at com.android.server.am.ActivityStack.completePauseLocked(ActivityStack.java:1353)
at com.android.server.am.ActivityStack.activityPausedLocked(ActivityStack.java:1155)
at com.android.server.am.ActivityManagerService.activityPaused(ActivityManagerService.java:8619)
- locked <0x0b129b42> (a com.android.server.am.ActivityManagerService)
at android.app.ActivityManagerNative.onTransact(ActivityManagerNative.java:545)
at com.android.server.am.ActivityManagerService.onTransact(ActivityManagerService.java:2871)
at android.os.Binder.execTransact(Binder.java:458)
tid=99 线程的 0x0b129b42 锁,是属于一个binder 线程的,由堆栈可以看出他是在调用的了底层:
InputDispatcher::setInputWindows
方法,然后等待了,查看此方法:
void InputDispatcher::setInputWindows(const Vector<sp<InputWindowHandle> >& inputWindowHandles) {
{
AutoMutex _l(mLock);
......
}
}
从代码上看他应该是在等待 mLock 锁,native 的锁,不像java 那样,会打印出 锁被谁持有,block在那里,这个怎么排查 ???
本次运气比较好,可以根据现有的数据推测出来 (如果运气不好呢,一步步排查代码,还是无解,堆栈打印应该改进了)
此mLock是属于InputDispatcher类里面定义的一个锁,
而InputDispatcher的代码大多执行在InputDispatcher && InputReader线程,查看 这两个线程的堆栈:
"InputDispatcher" prio=10 tid=38 Blocked
| group="main" sCount=1 dsCount=0 obj=0x12c04c40 self=0x7f88700e00
| sysTid=1330 nice=-8 cgrp=default sched=0/0 handle=0x7f86fff440
| state=S schedstat=( 81606171189 34762360375 500420 ) utm=5024 stm=3136 core=8 HZ=100
| stack=0x7f86f03000-0x7f86f05000 stackSize=1013KB
| held mutexes=
at com.android.server.am.ActivityManagerService.broadcastIntent(ActivityManagerService.java:21148)
- waiting to lock <0x0b129b42> (a com.android.server.am.ActivityManagerService) held by thread 99
at android.app.ContextImpl.sendBroadcast(ContextImpl.java:789)
at com.android.server.policy.PhoneWindowManager.takeScreenshotInteractive(PhoneWindowManager.java:6360)
at com.android.server.wm.InputMonitor.takeScreenshotInteractive(InputMonitor.java:487)
at com.android.server.input.InputManagerService.takeScreenshotInteractive(InputManagerService.java:1586)
InputDispatcher 线程也在等待 0x0b129b42 ,InputReader 线程无异常?
注意此处有可疑点:
InputDispatcher的代码 多数都是在native 执行,这里怎么执行到java 层了,如果takeScreenshotInteractive是在java 层调用的,那么takeScreenshotInteractive 之前的java 堆栈呢?
由此可以合理的怀疑:
takeScreenshotInteractive 方法是由native 层调用过来的
排查 takeScreenshotInteractive 的调用流程:
InputDispatcher.dispatchOnce – >
dispatchOnceInnerLocked –>
dispatchMotionLocked –>
com_android_server_input_InputManagerService.takeScreenshot
InputManagerService.takeScreenshotInteractive
查看InputDispatcher的dispatchOnce 方法:
void InputDispatcher::dispatchOnce() {
{ // acquire lock
AutoMutex _l(mLock);
......
dispatchOnceInnerLocked(&nextWakeupTime);
......
} // release lock
}
从dispatchOnce 方法可以看出,它是先申请到了 mLock 锁,通过一系列调用会调用到java 层takeScreenshotInteractive方法,在此方法中之后的调用会去发送一个广播,发送广播需要等待0x0b129b42 锁
由此造成了:
Binder_C 线程 持有 0x0b129b42 等待 mLock
而
InputDispatcher 线程持有mLock,等待0x0b129b42 锁,
因而死锁。
ps:
java 层死锁的 堆栈可以合理的推断出来,native 层的呢
java 调用native 方法堆栈中 可以体现流程
native 调用 java 的方法,如何获取其具体流程。
这都是堆栈需要改进的地方啊。