记一次错误的docker问题排查过程

环境

OS: SUSE Linux Enterprise Server 12 SP2
DOCKER: 1.12.6
KERNEL: 4.4.59-92.20-default
RANCHER: v1.6.2

问题

2018年1月某日,在测试环境中发现服务器出现

kernel:[1854773.108055] unregister_netdevice: waiting for eth0 to become free. Usage count = 1

临时解决办法是: reboot 😅

排查过程

  1. 依据报错信息很快找到这个bug,open时间是opened this issue on 6 May 2014
    https://github.com/moby/moby/issues/5618
    (现在这个问题貌似解决了,但是那时是1月)

  2. 在后来的日子里,此报错信息还伴随着,cpu负载变高,docker ps命令hang,等“杂音”

  3. 有人专门针对此问题给出了重现方法
    https://github.com/fho/docker-samba-loop
    在上面的操作系统内核版本上可以重现

kernel:[1598.704278] unregister_netdevice: waiting for lo to become free. Usage count = 1

如果修改dockerfile,追加命令

sleep 10

则不会有kernel 报错信息出现,可能是等待的过程网络连接正常关闭

  1. 此次bug 在kernel 4.4.114 上修复了
    https://cdn.kernel.org/pub/linux/kernel/v4.x/ChangeLog-4.4.114

commit edaafa805e0f9d09560a4892790b8e19cab8bf09
Author: Dan Streetman <ddstreet@ieee.org>
Date:   Thu Jan 18 16:14:26 2018 -0500

    net: tcp: close sock if net namespace is exiting
    
    
    [ Upstream commit 4ee806d51176ba7b8ff1efd81f271d7252e03a1d ]
    
    When a tcp socket is closed, if it detects that its net namespace is
    exiting, close immediately and do not wait for FIN sequence.
    
    For normal sockets, a reference is taken to their net namespace, so it will
    never exit while the socket is open.  However, kernel sockets do not take a
    reference to their net namespace, so it may begin exiting while the kernel
    socket is still open.  In this case if the kernel socket is a tcp socket,
    it will stay open trying to complete its close sequence.  The sock's dst(s)
    hold a reference to their interface, which are all transferred to the
    namespace's loopback interface when the real interfaces are taken down.
    When the namespace tries to take down its loopback interface, it hangs
    waiting for all references to the loopback interface to release, which
    results in messages like:
    
    unregister_netdevice: waiting for lo to become free. Usage count = 1
    
    These messages continue until the socket finally times out and closes.
    Since the net namespace cleanup holds the net_mutex while calling its
    registered pernet callbacks, any new net namespace initialization is
    blocked until the current net namespace finishes exiting.
    
    After this change, the tcp socket notices the exiting net namespace, and
    closes immediately, releasing its dst(s) and their reference to the
    loopback interface, which lets the net namespace continue exiting.
    
    Link: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407
    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=97811
    Signed-off-by: Dan Streetman <ddstreet@canonical.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

升级后,重试步骤3,不再出现报错

验证

在生产环境中升级了一个操作系统kernel 到4.4.114,但是发现问题依旧。
问题可能出现在,lo? eth0?

后续

待续

    原文作者:老吕子
    原文地址: https://www.jianshu.com/p/4ce0412b50c3
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞