一次 OpenStack 虚拟机迁移失败的收宫之战

2023年2月2日 262次阅读来源: 李旭超

背景：

1、商用环境私有云硬件单板故障，单板上运行的所有虚拟机需要迁移。

2、宿主机虚拟机IO操作频繁，触发宿主机IO告警，需要分布迁移虚拟机。

理论措施：

1、通过 virsh list 查询故障单板上的虚拟机 Name 列表。

2、通过 virsh domuuid <Name> 查询虚拟机 uuid。

3、迁移步骤，不指定宿主机，让 scheduler 自动选择最优节点。

热迁移（系统盘为后端存储）：

nova live-migration <uuid>

冷迁移（系统盘为本地存储）：

nova stop <uuid>
nova migrate <uuid>
watch -n 6 "nova list --all-t | grep <uuid>" 
// 每 6 秒查看一次虚拟机状态，等 Task State 变为 VERIFY_RESIZE 时，执行：
nova resize-confirm <uuid>

ie. shut down vm, qemu-img convert, rsync over file, start vm.

实际问题：

第一次：迁移失败

定位过程：

1、通过 nova instance-action-list <uuid> 查看最后一次迁移的 Request_ID。

2、通过 nova instance-action <uuid> <Request_ID> 查看 Python 抛出的异常信息。

3、错误比较明显，原来是虚拟机 image 镜像被删除了，迁移过程中需要下载镜像到选定的宿主机上重新 boot，但是无法找到，确实是个坑，只能更换虚拟机进行迁移测试了。

第二次：迁移失败

这次报错就不是很明显，只能扒日志了。

1、根据 Request_ID 关键字查询 nova-scheduler 日志，发现关键信息：

[req-29a2557c-a908-43e5-8d78-bbc26c9a7369] cold migration  565ec4e2-a1e6-4dfe-9477-3beba94a0fdc from  420D1823-AD1B-E711-9E9B-0425C59CDC2E to host 80865E33-8C76-E511-A89B-2A33C67636E0

2、既然迁移失败，选择的宿主机 ID 为 80865E33-8C76-E511-A89B-2A33C67636E0，那就需要登陆上去查看 nova-compute 日志了。

3、登陆迁移节点宿主机，nova-compute 关键信息：

if ret is None:raise libvirtError('virDomainDefineXML() failed', conn=self) libvirtError: invalid argument: could not find capabilities for domaintype=kvm

4、无奈啊，libvirtError，感觉不妙，kvm libvirtError。习惯性查吧查吧：

应该是支持 kvm 虚拟化的吧？

egrep '^flags.*(vmx|svm)' /proc/cpuinfo // 输出一大坨，果然支持

什么 CPU 呢？

cat /proc/cpuinfo  // 好吧，叮咚叮咚，是 Intel

kvm 内核模块是否加载了呢？

lsmod | grep kvm // 卧槽，果然有问题，kvm  479242  0，kvm_intel 呢？！[2]

无奈了，加载一下下。

modprobe kvm-intel // 真好，报错了
FATAL: Error inserting kvm_intel (/lib/modules/2.6.20-ARCH/kernel/drivers/kvm/kvm-intel.ko): Operation not supported

求助吧，Google 一下下：

大神曰：dmesg | grep kvm 看一下呢？卧槽！

[   46.326597] kvm: disabled by bios
[1015924.318717] kvm: disabled by bios
[1016529.656032] kvm: disabled by bios

真好，kvm 被 BIOS 禁用了。哎，深夜了，明天再迁吧，何苦为难自己 = =

第三次：迁移成功！

omit ..

迁移 issue 续：

issue 1:

[req-fc7c5a48-8983-4283-afde-e189fa3b5526] [instance: 5e280ac0-a4d4-4edd-a6ae-db725a90e2aa] Setting instance back to active after: Instance rollback performed due to: Unexpected error while running command. Command: rsync –sparse –compress /opt/HUAWEI/image/instances/5e280ac0-a4d4-4edd-a6ae-db725a90e2aa_resize/disk 172.28.0.126:/opt/HUAWEI/image/instances/5e280ac0-a4d4-4edd-a6ae-db725a90e2aa/disk Exit code: 23 Stdout: u” Stderr: u’Warning: Permanently added \’172.28.0.126\’ (RSA) to the list of known hosts.\r\n\nWelcome!\nrsync: read errors mapping “/opt/HUAWEI/image/instances/5e280ac0-a4d4-4edd-a6ae-db725a90e2aa_resize/disk”: Input/output error (5)\nrsync: read errors mapping “/opt/HUAWEI/image/instances/5e280ac0-a4d4-4edd-a6ae-db725a90e2aa_resize/disk”: Input/output error (5)\nERROR: disk failed verification — update discarded.\nrsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1040) [sender=3.0.4]\n’

The rsync error

read errors mapping ….: Input/output error (5)
indicates the impossibility of rsync to read or write a file. The most likely causes of this error are disk defects, either in the SRC or in the TGT directory. Other possibilities however include insufficient permissions, file lock by anti-virus programs, and maybe other causes.
The first step toward a diagnosis is to try to copy the files manually. This may work if, for instance, the source of the error was a disk defect in the TGT directory; by repeating the operation at a later time, you will write into a different section of the disk, and the problem may have evaporated.
Alternatively, you may discover that you cannot access the file in the SRC directory. In this case I suggest that you employ any of the disk checking utilities available to your distro.
Insufficient privileges, anti-virus, are easier to diagnose.
Lastly, if you have a bad sector on your SRC directory, you may exclude that from future runs of rsync by means of:
rsync -av –exclude=’/home/my_name/directory_with_corrupt_files/*’

####

一些 reference

[1] https://wiki.archlinux.org/index.php/Kernel_modules#Manual_module_handling

[2] https://wiki.archlinux.org/index.php/KVM

>>> Useful Tip: If modprobing kvm_intel or kvm_amd fails but modprobing kvm succeeds, (and lscpu claims that hardware acceleration is supported), check your BIOS settings. Some vendors (especially laptop vendors) disable these processor extensions by default. To determine whether there’s no hardware support or there is but the extensions are disabled in BIOS, the output from dmesg after having failed to modprobe will tell.

[3] How to interpret and fix a Input/output error in Linux?

    原文作者：李旭超
    原文地址: https://zhuanlan.zhihu.com/p/27275895
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。