一、问题描述:
10台机器进行某项自动化测试,一轮5天,发现一台机器没有完成测试就停止了。
二、分析过程:
1. 拿到log,可以快速地定位到system_server发生了crash导致android层重启,且直接原因是全局引用表溢出,虚拟机dump信息如下:
08-25 18:41:14.285 955 4704 F zygote64: indirect_reference_table.cc:256] JNI ERROR (app bug): global reference table overflow (max=51200)
08-25 18:41:14.285 955 4704 F zygote64: indirect_reference_table.cc:256] global reference table dump:
08-25 18:41:14.285 955 4704 F zygote64: indirect_reference_table.cc:256] Last 10 entries (of 51200):
08-25 18:41:14.285 955 4704 F zygote64: indirect_reference_table.cc:256] 51199: 0x12e3ba60 com.android.server.content.ContentService$ObserverNode$ObserverEntry
08-25 18:41:14.285 955 4704 F zygote64: indirect_reference_table.cc:256] 51198: 0x12d93760 com.android.server.am.ServiceRecord
08-25 18:41:14.285 955 4704 F zygote64: indirect_reference_table.cc:256] 51197: 0x12d8fa20 java.lang.ref.WeakReference (referent is a android.os.BinderProxy)
08-25 18:41:14.285 955 4704 F zygote64: indirect_reference_table.cc:256] 51196: 0x12e391b8 java.lang.ref.WeakReference (referent is a android.os.BinderProxy)
08-25 18:41:14.285 955 4704 F zygote64: indirect_reference_table.cc:256] 51195:0x12c4db58 com.android.server.am.ServiceRecord
08-25 18:41:14.285 955 4704 F zygote64: indirect_reference_table.cc:256] 51194: 0x12dc3e78 java.lang.ref.WeakReference (referent is a android.os.BinderProxy)
08-25 18:41:14.285 955 4704 F zygote64: indirect_reference_table.cc:256] 51193: 0x12e3b560 java.lang.ref.WeakReference (referent is a android.os.BinderProxy)
08-25 18:41:14.285 955 4704 F zygote64: indirect_reference_table.cc:256] 51192: 0x12e38718 java.lang.ref.WeakReference (referent is a android.os.BinderProxy)
08-25 18:41:14.285 955 4704 F zygote64: indirect_reference_table.cc:256] 51191: 0x12dc3fc0 java.lang.ref.WeakReference (referent is a android.os.BinderProxy)
08-25 18:41:14.285 955 4704 F zygote64: indirect_reference_table.cc:256] 51190: 0x12fc8b60 com.android.server.am.ServiceRecord
08-25 18:41:14.285 955 4704 F zygote64: indirect_reference_table.cc:256] Summary:
08-25 18:41:14.285 955 4704 F zygote64: indirect_reference_table.cc:256] 24615 of java.lang.ref.WeakReference (24615 unique instances)
08-25 18:41:14.285 955 4704 F zygote64: indirect_reference_table.cc:256] 23622 of com.android.server.content.ContentService$ObserverNode$ObserverEntry (23622 unique instances)
08-25 18:41:14.285 955 4704 F zygote64: indirect_reference_table.cc:256] 758 of android.os.Binder (758 unique instances)
08-25 18:41:14.285 955 4704 F zygote64: indirect_reference_table.cc:256] 485 of com.android.server.notification.NotificationManagerService$StatusBarNotificationHolder (485 unique instances)
08-25 18:41:14.285 955 4704 F zygote64: indirect_reference_table.cc:256] 468 of com.android.server.am.ServiceRecord (468 unique instances)
08-25 18:41:14.285 955 4704 F zygote64: indirect_reference_table.cc:256] 319 of java.lang.Class (239 unique instances)
08-25 18:41:14.285 955 4704 F zygote64: indirect_reference_table.cc:256] 181 of java.nio.DirectByteBuffer (170 unique instances)
08-25 18:41:14.285 955 4704 F zygote64: indirect_reference_table.cc:256] 119 of android.os.RemoteCallbackList$Callback (119 unique instances)
由上面dump的关键信息可以得到,虚拟机全局引用表对象引用个数限制为51200个;最后在创建一个ObserverEntry对象;而引用数最多的两个对象是ObserverEntry和WeakReference(怀疑是binder实体的弱引用,ObserverEntry代表一个内容监听器在framwork的实例,是用来回调到app端的,所以ObserverEntry和WeakReference 数量接近)。这里只能从ObserverEntry对象入手继续分析,结合ContentService的实现,怀疑有应用反复注册了内容监听器而没有在适当的时候释放。需要现场来进行定位确认,但是复现问题的机器现场已经丢失,只能拿做完测试但未复现问题、且没重启过的机器来做现场调试。
2.拿到一台测试后没有重启的机器。既然ObserverEntry的对象这么多,那么就dumpsys content看看ContentService的信息,想进一步了解其内部信息可以看看ContentService的registerContentObserver、unregisterContentObserver和dump函数的逻辑。这里不详细说,直接看结果。这里有大量的监听器,且注册者都是pid=1266的进程。通过ps命令查看或者log,发现pid 1266是SystemUI。
……
settings/global/always_on_display_constants:pid=1266 uid=10044 user=-1 target=17dc217
settings/global/always_on_display_constants:pid=1266 uid=10044 user=-1 target=878d604
……
pid 1266: 10532 observers
……
3. 现在我们基本确定是SystemUI反复注册了always_on_display_constants的监听器而没有释放,通过监听器内容搜索SystemUI的代码,找到其中注册always_on_display_constants监听器的逻辑,发现在启动DozeService的时候总共会注册三个always_on_display_constants的监听器,但是没有看到有注销的地方。这应该就是ObserverEntry对象泄露的原因。
4.那DozeService什么时候启动,又是什么时候退出呢?再看看log,有大量的启动停止DozeService的log记录,启动记录7609条,每启动退出一次泄露3个,在能看到的log记录总共就泄露22827个。非常接近前面虚拟机dump出来的引用表信息ObserverEntry的数量。
08-21 05:41:56.013 955 4704 I PowerManagerService: Going to sleep due to power button (uid 1000)…
……
08-21 05:41:56.713 955 974 I DreamController: Starting dream: name=ComponentInfo{com.android.systemui/com.android.systemui.doze.DozeService}, isTest=false, canDoze=true, userId=0
……
08-21 05:42:01.530 955 1520 I PowerManagerService: Waking up from dozing (uid=1000 reason=android.policy:POWER)…
……
08-21 05:42:01.829 955 974 I DreamController: Stopping dream: name=ComponentInfo{com.android.systemui/com.android.systemui.doze.DozeService}, isTest=false, canDoze=true, userId=0
自动化测试通常都比较耗时,问题不好验证,最好是找到复现问题的步骤。这个问题也比较明显,DozeService会在灭屏的时候启动,亮屏的时候退出,这是framework的逻辑,这里不详细分析。我们通过亮灭屏的动作和dumpsys content看看我们前面的猜测是不是对的。灭屏后亮屏可以看到observer的对象增加3,而增加的3个都是always_on_display_constants
4. 和测试的同事确认这个自动化测试用例中是否有亮灭屏的动作,得到了肯定的答复,是客户要求的测试用例!这样我们就有了复现问题的方法,后续修改验证也就方便了许多。
三、修改方案
diff –git a/packages/SystemUI/src/com/android/systemui/doze/AlwaysOnDisplayPolicy.java b/packages/SystemUI/src/com/android/systemui/doze/AlwaysOnDisplayPolicy.java
index debda21..9b69031 100644
— a/packages/SystemUI/src/com/android/systemui/doze/AlwaysOnDisplayPolicy.java
+++ b/packages/SystemUI/src/com/android/systemui/doze/AlwaysOnDisplayPolicy.java
@@ -102,6 +102,14 @@
mSettingsObserver.observe();
}
+ public void clear() {
+ if (mSettingsObserver != null) {
+ //mSettingsObserver.clearObserve();
+ ContentResolver resolver = mContext.getContentResolver();
+ resolver.unregisterContentObserver(mSettingsObserver);
+ }
+ }
+
private int[] parseIntArray(final String key, final int[] defaultArray) {
final String value = mParser.getString(key, null);
if (value != null) {
@@ -130,7 +138,12 @@
false, this, UserHandle.USER_ALL);
update(null);
}
–
+/*
+ void clearObserve() {
+ ContentResolver resolver = mContext.getContentResolver();
+ resolver.registerContentObserver(this);
+ }
+*/
@Override
public void onChange(boolean selfChange, Uri uri) {
update(uri);
diff –git a/packages/SystemUI/src/com/android/systemui/doze/DozeMachine.java b/packages/SystemUI/src/com/android/systemui/doze/DozeMachine.java
index 8ec6afc..7d3dc3f 100644
— a/packages/SystemUI/src/com/android/systemui/doze/DozeMachine.java
+++ b/packages/SystemUI/src/com/android/systemui/doze/DozeMachine.java
@@ -129,6 +129,13 @@
mParts = parts;
}
+ /** clear some reference stored in framework-system_server */
+ public void clear() {
+ for (Part p : mParts) {
+ p.clear();
+ }
+ }
+
/**
* Requests transitioning to {@code requestedState}.
*
@@ -348,6 +355,8 @@
*/
void transitionTo(State oldState, State newState);
+ default void clear() {}
+
/** Dump current state. For debugging only. */
default void dump(PrintWriter pw) {}
}
diff –git a/packages/SystemUI/src/com/android/systemui/doze/DozePauser.java b/packages/SystemUI/src/com/android/systemui/doze/DozePauser.java
index 58f1448..f7f49225 100644
— a/packages/SystemUI/src/com/android/systemui/doze/DozePauser.java
+++ b/packages/SystemUI/src/com/android/systemui/doze/DozePauser.java
@@ -50,6 +50,13 @@
}
}
+ @Override
+ public void clear() {
+ if (mPolicy != null) {
+ mPolicy.clear();
+ }
+ }
+
private void onTimeout() {
mMachine.requestState(DozeMachine.State.DOZE_AOD_PAUSED);
}
diff –git a/packages/SystemUI/src/com/android/systemui/doze/DozeScreenBrightness.java b/packages/SystemUI/src/com/android/systemui/doze/DozeScreenBrightness.java
index 4bb4e79..f31d3c6 100644
— a/packages/SystemUI/src/com/android/systemui/doze/DozeScreenBrightness.java
+++ b/packages/SystemUI/src/com/android/systemui/doze/DozeScreenBrightness.java
@@ -38,6 +38,7 @@
private final Sensor mLightSensor;
private final int[] mSensorToBrightness;
private final int[] mSensorToScrimOpacity;
+ private final AlwaysOnDisplayPolicy mPolicy;
private boolean mRegistered;
private int mDefaultDozeBrightness;
@@ -58,16 +59,31 @@
mDefaultDozeBrightness = defaultDozeBrightness;
mSensorToBrightness = sensorToBrightness;
mSensorToScrimOpacity = sensorToScrimOpacity;
+
+ mPolicy = null;
}
@VisibleForTesting
public DozeScreenBrightness(Context context, DozeMachine.Service service,
SensorManager sensorManager, Sensor lightSensor, DozeHost host,
Handler handler, AlwaysOnDisplayPolicy policy) {
+/*
this(context, service, sensorManager, lightSensor, host, handler,
context.getResources().getInteger(
com.android.internal.R.integer.config_screenBrightnessDoze),
policy.screenBrightnessArray, policy.dimmingScrimArray);
+*/
+ mContext = context;
+ mDozeService = service;
+ mSensorManager = sensorManager;
+ mLightSensor = lightSensor;
+ mDozeHost = host;
+ mHandler = handler;
+
+ mDefaultDozeBrightness = context.getResources().getInteger(com.android.internal.R.integer.config_screenBrightnessDoze);
+ mSensorToBrightness = policy.screenBrightnessArray;
+ mSensorToScrimOpacity = policy.dimmingScrimArray;
+ mPolicy = policy;
}
@Override
@@ -94,6 +110,13 @@
}
@Override
+ public void clear() {
+ if (mPolicy != null) {
+ mPolicy.clear();
+ }
+ }
+
+ @Override
public void onSensorChanged(SensorEvent event) {
Trace.beginSection(“DozeScreenBrightness.onSensorChanged” + event.values[0]);
try {
diff –git a/packages/SystemUI/src/com/android/systemui/doze/DozeSensors.java b/packages/SystemUI/src/com/android/systemui/doze/DozeSensors.java
index 91cde37..000e47a 100644
— a/packages/SystemUI/src/com/android/systemui/doze/DozeSensors.java
+++ b/packages/SystemUI/src/com/android/systemui/doze/DozeSensors.java
@@ -133,6 +133,12 @@
return null;
}
+ public void clear() {
+ if (mProxSensor != null) {
+ mProxSensor.clear();
+ }
+ }
+
public void setListening(boolean listen) {
for (TriggerSensor s : mSensors) {
s.setListening(listen);
@@ -234,6 +240,12 @@
updateRegistered();
}
+ public void clear() {
+ if (mPolicy != null) {
+ mPolicy.clear();
+ }
+ }
+
private void updateRegistered() {
setRegistered(mRequested && !mCooldownTimer.isScheduled());
}
diff –git a/packages/SystemUI/src/com/android/systemui/doze/DozeService.java b/packages/SystemUI/src/com/android/systemui/doze/DozeService.java
index 98b1106..b147f97 100644
— a/packages/SystemUI/src/com/android/systemui/doze/DozeService.java
+++ b/packages/SystemUI/src/com/android/systemui/doze/DozeService.java
@@ -57,6 +57,16 @@
mDozeMachine = new DozeFactory().assembleMachine(this);
}
+ /** {@inheritDoc} */
+ @Override
+ public void onDestroy() {
+ Log.d(TAG, “onDestroy() being called when DreamController stop this service”);
+ if (mDozeMachine != null) {
+ mDozeMachine.clear();
+ }
+ super.onDestroy();
+ }
+
@Override
public void onPluginConnected(DozeServicePlugin plugin, Context pluginContext) {
mDozePlugin = plugin;
diff –git a/packages/SystemUI/src/com/android/systemui/doze/DozeTriggers.java b/packages/SystemUI/src/com/android/systemui/doze/DozeTriggers.java
index f7a258a..ea6ae4d 100644
— a/packages/SystemUI/src/com/android/systemui/doze/DozeTriggers.java
+++ b/packages/SystemUI/src/com/android/systemui/doze/DozeTriggers.java
@@ -212,6 +212,13 @@
}
}
+ @Override
+ public void clear() {
+ if (mDozeSensors != null) {
+ mDozeSensors.clear();
+ }
+ }
+
private void checkTriggersAtInit() {
if (mUiModeManager.getCurrentModeType() == Configuration.UI_MODE_TYPE_CAR
|| mDozeHost.isPowerSaveActive()
四、验证方法
由于问题比较清晰,我们直接测试20000次”休眠唤醒”的动作来对比验证。准备两台机器,一台刷老版本;一台刷修改后的版本。验证结果为老版本8000次左右就发生了crash,而修改后的版本20000次后系统依然正常。