java 与native 互锁,造成的watchdog

一、发生watchdog有两种情况:

1、system_server 特定进程特别繁忙,导致 watchdog ,无法向system_server 的特定线程post检测任务,导致的watchdog 超时,只有在system_server 特别繁忙(类似陷入死循环,或死锁),在长达1分钟的时间内,handler 无法处理新到来的一个任务的情况下才会发生此情况,比较少见

2、死锁, 之前处理的watchdog 大都是此中情况。
(ps:之前处理的watchdog,都是java层的死锁,只要堆栈全,很好处理,本次遇到了java 与native 的互锁)

二、发生问题

1、发生问题为:
相机大光圈使用音量减键拍照,然后自动重启
2、WatchDog 信息:

    Blocked in monitor com.android.server.am.ActivityManagerService on foreground thread (android.fg), Blocked in handler on ui thread (android.ui), Blocked in handler on ActivityManager (ActivityManager)

3、Trace 信息以及分析:

----- pid 1185 at 2017-05-08 13:53:05 -----
Cmd line: system_server
"android.fg" prio=5 tid=14 Blocked
  | group="main" sCount=1 dsCount=0 obj=0x12cc4cf0 self=0x7f94e92e00
  | sysTid=1243 nice=0 cgrp=default sched=0/0 handle=0x7f8a4df440
  | state=S schedstat=( 15683722127 21550750622 33957 ) utm=857 stm=711 core=0 HZ=100
  | stack=0x7f8a3dd000-0x7f8a3df000 stackSize=1037KB
  | held mutexes=
  at com.android.server.am.ActivityManagerService.monitor(ActivityManagerService.java:24115)
  - waiting to lock <0x0b129b42> (a com.android.server.am.ActivityManagerService) held by thread 99
  at com.android.server.Watchdog$HandlerChecker.run(Watchdog.java:214)
  at android.os.Handler.handleCallback(Handler.java:815)
  at android.os.Handler.dispatchMessage(Handler.java:104)
  at android.os.Looper.loop(Looper.java:207)
  at android.os.HandlerThread.run(HandlerThread.java:61)
  at com.android.server.ServiceThread.run(ServiceThread.java:46)

由以上堆栈可以看出,android.fg 正在等待,tid=99 线程的 0x0b129b42 锁。(android.ui 与ActivityManager 也是在等待这个锁)

"Binder_C" prio=5 tid=99 Native
  | group="main" sCount=1 dsCount=0 obj=0x13a6d0a0 self=0x7f87e4c800
  | sysTid=2194 nice=-4 cgrp=default sched=0/0 handle=0x7f7f7f1440
  | state=S schedstat=( 167348256976 82664258066 457250 ) utm=10416 stm=6318 core=9 HZ=100
  | stack=0x7f7f6f5000-0x7f7f6f7000 stackSize=1013KB
  | held mutexes=
  kernel: (couldn't read /proc/self/task/2194/stack)
  native: #00 pc 000000000001c02c /system/lib64/libc.so (syscall+28)
  native: #01 pc 0000000000068394 /system/lib64/libc.so (_ZL33__pthread_mutex_lock_with_timeoutP24pthread_mutex_internal_tPK8timespeci.constprop.0+484)
  native: #02 pc 000000000006863c /system/lib64/libc.so (pthread_mutex_lock+36)
  native: #03 pc 000000000002eb48 /system/lib64/libinputflinger.so (_ZN7android15InputDispatcher15setInputWindowsERKNS_6VectorINS_2spINS_17InputWindowHandleEEEEE+88)
  native: #04 pc 0000000000014c44 /system/lib64/libandroid_servers.so (_ZN7android18NativeInputManager15setInputWindowsEP7_JNIEnvP13_jobjectArray+336)
  native: #05 pc 00000000007911ec /system/framework/oat/arm64/services.odex (Java_com_android_server_input_InputManagerService_nativeSetInputWindows__J_3Lcom_android_server_input_InputWindowHandle_2+160)
  at com.android.server.input.InputManagerService.nativeSetInputWindows(Native method)
  at com.android.server.input.InputManagerService.setInputWindows(InputManagerService.java:1212)
  at com.android.server.wm.InputMonitor.updateInputWindowsLw(InputMonitor.java:414)
  at com.android.server.wm.InputMonitor.resumeDispatchingLw(InputMonitor.java:576)
  at com.android.server.wm.WindowManagerService.resumeKeyDispatching(WindowManagerService.java:8412)
  - locked <0x03a77890> (a java.util.HashMap)
  at com.android.server.am.ActivityRecord.resumeKeyDispatchingLocked(ActivityRecord.java:1028)
  at com.android.server.am.ActivityStack.finishCurrentActivityLocked(ActivityStack.java:3642)
  at com.android.server.am.ActivityStack.completePauseLocked(ActivityStack.java:1353)
  at com.android.server.am.ActivityStack.activityPausedLocked(ActivityStack.java:1155)
  at com.android.server.am.ActivityManagerService.activityPaused(ActivityManagerService.java:8619)
  - locked <0x0b129b42> (a com.android.server.am.ActivityManagerService)
  at android.app.ActivityManagerNative.onTransact(ActivityManagerNative.java:545)
  at com.android.server.am.ActivityManagerService.onTransact(ActivityManagerService.java:2871)
  at android.os.Binder.execTransact(Binder.java:458)

tid=99 线程的 0x0b129b42 锁,是属于一个binder 线程的,由堆栈可以看出他是在调用的了底层:
InputDispatcher::setInputWindows
方法,然后等待了,查看此方法:

void InputDispatcher::setInputWindows(const Vector<sp<InputWindowHandle> >& inputWindowHandles) {
    { 
        AutoMutex _l(mLock);
        ......
    }
}

从代码上看他应该是在等待 mLock 锁,native 的锁,不像java 那样,会打印出 锁被谁持有,block在那里,这个怎么排查 ???
本次运气比较好,可以根据现有的数据推测出来 (如果运气不好呢,一步步排查代码,还是无解,堆栈打印应该改进了)
此mLock是属于InputDispatcher类里面定义的一个锁,
而InputDispatcher的代码大多执行在InputDispatcher && InputReader线程,查看 这两个线程的堆栈:

"InputDispatcher" prio=10 tid=38 Blocked                                                                   
  | group="main" sCount=1 dsCount=0 obj=0x12c04c40 self=0x7f88700e00
  | sysTid=1330 nice=-8 cgrp=default sched=0/0 handle=0x7f86fff440
  | state=S schedstat=( 81606171189 34762360375 500420 ) utm=5024 stm=3136 core=8 HZ=100
  | stack=0x7f86f03000-0x7f86f05000 stackSize=1013KB
  | held mutexes=
  at com.android.server.am.ActivityManagerService.broadcastIntent(ActivityManagerService.java:21148)
  - waiting to lock <0x0b129b42> (a com.android.server.am.ActivityManagerService) held by thread 99
  at android.app.ContextImpl.sendBroadcast(ContextImpl.java:789)
  at com.android.server.policy.PhoneWindowManager.takeScreenshotInteractive(PhoneWindowManager.java:6360)
  at com.android.server.wm.InputMonitor.takeScreenshotInteractive(InputMonitor.java:487)
  at com.android.server.input.InputManagerService.takeScreenshotInteractive(InputManagerService.java:1586)

InputDispatcher 线程也在等待 0x0b129b42 ,InputReader 线程无异常?

注意此处有可疑点:
InputDispatcher的代码 多数都是在native 执行,这里怎么执行到java 层了,如果takeScreenshotInteractive是在java 层调用的,那么takeScreenshotInteractive 之前的java 堆栈呢?
由此可以合理的怀疑:
takeScreenshotInteractive 方法是由native 层调用过来的
排查 takeScreenshotInteractive 的调用流程:
InputDispatcher.dispatchOnce – >
dispatchOnceInnerLocked –>
dispatchMotionLocked –>
com_android_server_input_InputManagerService.takeScreenshot
InputManagerService.takeScreenshotInteractive

查看InputDispatcher的dispatchOnce 方法:

void InputDispatcher::dispatchOnce() {
    { // acquire lock
        AutoMutex _l(mLock);
        ......
        dispatchOnceInnerLocked(&nextWakeupTime);
        ......
    } // release lock
}

从dispatchOnce 方法可以看出,它是先申请到了 mLock 锁,通过一系列调用会调用到java 层takeScreenshotInteractive方法,在此方法中之后的调用会去发送一个广播,发送广播需要等待0x0b129b42 锁
由此造成了:
Binder_C 线程 持有 0x0b129b42 等待 mLock

InputDispatcher 线程持有mLock,等待0x0b129b42 锁,
因而死锁。

ps:
java 层死锁的 堆栈可以合理的推断出来,native 层的呢
java 调用native 方法堆栈中 可以体现流程
native 调用 java 的方法,如何获取其具体流程。
这都是堆栈需要改进的地方啊。

    原文作者:java锁
    原文地址: https://blog.csdn.net/xiaolli/article/details/72150384
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞