4.15.0-1034-gcp内核TCP SACK BUG导致的请求无返回

起因

某天更新发现某服务收到请求但客户端无法收到返回。幸运的是,客户端同学能在测试环境重现问题。2分法找到首个故障版本并进行了回退。故障版本仅仅修改了snd_buf,理论上不应导致该问题。

追查过程

抓包判断故障端

定位到服务端未返回

15:27:52.589785 IP (tos 0x0, ttl 52, id 15592, offset 0, flags [DF], proto TCP (6), length 187)
    106.11.0.208.26082 > XXX.12345: Flags [P.], cksum 0xfe51 (correct), seq 51:186, ack 104, win 347, options [nop,nop,TS val 94902541 ecr 3888443977], length 135
E...<.@.4..Aj.......e.09..1........[.Q.....
.......I....s...K......}..[..KG.....N...r?[T....\"...U@q.3.O*M..5..........e...hx.1...43K7#.......<Gu..O..&qNX.o .....Z......?.s....q....N*.y..
15:27:52.632302 IP (tos 0x0, ttl 63, id 13795, offset 0, flags [DF], proto TCP (6), length 52)
    XXX.12345 > 106.11.0.208.26082: Flags [.], cksum 0x6b54 (correct), seq 104, ack 186, win 224, options [nop,nop,TS val 3888446989 ecr 94902541], length 0
E..45.@.?.......j...09e.......2(....kT.....
.......

服务端调用了 :gen_tcp.send, 发送接口返回:ok. 为什么抓包无法抓到呢?

排查erlang的问题

erlang版本: OTP-21.0
最终定位至 inet_drv.c:10902, 这里也可以看到, erlang最终是调用了writev.
使用ltrace跟踪writev系统调用.

1ltrace -p 343 -e writev
2
3<... writev resumed> )                                                                                                                            = 4
4<... writev resumed> )                                                                                                                            = 4
5exe->writev(233, 0xe697e9d0, 2, 247)                                                                                                              = 213
6exe->writev(249, 0xe697e9d0, 2, 10)                                                                                                               = 0x5166
7exe->writev(19, 0xe687c800, 2, 0x2b5b82c0)                                                                                                        = 34

发现调用了writev, 0x5166的长度, 和代码里send的长度20834一致(0x5166=20834+4, 还有erlang自动添加的4字节包头长度).

inet_drv.c:10856
    case TCP_PB_4: 
            put_int32(len, buf);
            h_len = 4; 
            break

排查docker的问题

尝试在docker 容器内用执行如下python代码

import socket
s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
s.bind(('0.0.0.0',12345))
s.listen(5)
conn, addr = s.accept()
conn.setblocking(0)
conn.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 4096)
conn.send("a" * 40960)
conn.close()

在另一个容器telnet, 能够收到返回, 不能重现问题.
而直接在容器外, 使用同样脚本, 从外网telnet. 可以重现问题. 故排除docker问题.

排查linux kernel问题

发现同样python脚本, 在某个QA环境, 内核版本为4.15.0-1026-gcp, 无法重现问题.
在另一QA环境, 内核版本为4.15.0-1034-gcp, 可重现问题. 于是怀疑是linux kernel.

查看linux kernel 4.15.0 提交日志

linux v4.15有9个rc版本, 正式版在Jan 29, 2018, 理论上, 不应该那么久都没有人报bug的.

gcp

所以, gcp是什么呢?

Software Description

linux - Linux kernel
linux-aws - Linux kernel for Amazon Web Services (AWS) systems
linux-gcp - Linux kernel for Google Cloud Platform (GCP) systems
linux-kvm - Linux kernel for cloud environments
linux-raspi2 - Linux kernel for Raspberry Pi 2
linux-snapdragon - Linux kernel for Snapdragon processors
linux-azure - Linux kernel for Microsoft Azure Cloud systems
linux-hwe - Linux hardware enablement (HWE) kernel
linux-oem - Linux kernel for OEM processors
linux-oracle - Linux kernel for Oracle Cloud systems
linux-aws-hwe - Linux kernel for Amazon Web Services (AWS-HWE) systems

这里猜测是谷歌云对linux kernel做了一些私有化修改.
于是从1026往上逐个测试linux内核版本:

root@XXX:/home/dingxinglong# apt-cache search gcp | grep 4.15.0 | grep signed
linux-image-unsigned-4.15.0-1018-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1019-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1021-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1023-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1024-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1025-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1026-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1027-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1028-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1029-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1030-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1032-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1033-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1034-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1036-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1037-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1040-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1041-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1042-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1044-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1046-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1047-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP

发现确实和linux kernel image相关:
4.15.0-1026-gcp OK
4.15.0-1027-gcp OK
4.15.0-1028-gcp OK
4.15.0-1029-gcp OK
4.15.0-1030-gcp OK
4.15.0-1032-gcp OK
4.15.0-1033-gcp OK
4.15.0-1034-gcp BUG
4.15.0-1036-gcp OK
同一台机器, 在1033版本ok, 升到1034后有BUG, 升到1036后再次OK.
可查到对应的bug:
http://changelogs.ubuntu.com/…

  • Remote denial of service (resource exhaustion) caused by TCP SACK scoreboard

bug出现时间是:
— Marcelo Henrique Cerri <marcelo.cerri@canonical.com> Thu, 06 Jun 2019 11:07:33 -0300
buf fix时间是:
— Kleber Sacilotto de Souza <kleber.souza@canonical.com> Mon, 24 Jun 2019 14:48:10 +0200
没有搜索到bug_fix对应的代码. 猜测是非开源代码. 也许这是google云干不过amazon的原因?
这里附上升级kernel的命令行.

apt-get install -y linux-image-4.15.0-1034-gcp linux-headers-4.15.0-1034-gcp linux-modules-4.15.0-1034-gcp  linux-modules-extra-4.15.0-1034-gcp

总结

  • 发布任何更新内容前, 准备好回滚方案, 总有想象不到的问题.
  • 自动化测试用例尽量丰富. 如果测试用例的返回包够大, 会及早暴露问题.
  • 能够重现的bug, 尽早追到底, 总有收获.
    原文作者:enjolras1205
    原文地址: https://segmentfault.com/a/1190000020890184
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞