背景
本文关于如何在Ubuntu16.04上安装Nvidia GPU驱动。如果要使用docker容器来起AI服务的话,则无需安装CUDA和cuDNN(这是推荐的方式);而如果需要在宿主机上直接启动AI服务,则还需要安装CUDA和cuDNN(这是不推荐的方式)。
Gemfield使用的操作系统是Kubuntu 16.04.02,而Kubuntu 16.04是一个LTS版本,所以后续还会有新的patch release。
告警1:有的机器需要禁止掉bios的seurity功能才能重新启动,否则重启进入不了系统(尚不知道原因)。
安装Nvidia GPU驱动
无论是否使用Docker化的方式,都必须进行这一步。有两种方式:离线安装和在线安装:
1,在线安装(推荐)
gemfield@localhost:~$ sudo apt install nvidia-384-dev nvidia-modprobe
2,离线安装(不推荐)
安装Nvidia GPU 驱动,从其官网上下载安装程序(一段shell),我下载的时候是NVIDIA-Linux-x86_64-381.22.run,下载后运行。
一个错误分享
今天按照直接apt的方式直接在线安装nvidia-384-dev,出现错误:NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
本质错误其实是nvidia驱动没有被内核加载:modprobe: ERROR: could not insert ‘nvidia_384’: Unknown symbol in module, or unknown parameter (see dmesg)。
命令行安装完成后看看驱动是否已经到位,下面这种情况就没有nvidia.ko:
gemfield@ubuntu:/lib/modules# find . -name "*.ko" | grep -i nvidia
./4.4.0-127-generic/kernel/drivers/video/fbdev/nvidia/nvidiafb.ko
./4.4.0-127-generic/kernel/drivers/net/ethernet/nvidia/forcedeth.ko
那是因为apt的时候已经提示了:没有安装kernel source!
Module build for the currently running kernel was skipped since the
kernel source for this kernel does not seem to be installed.
安装kernel source的方法:
sudo apt-get install linux-source
sudo apt-get install linux-headers-4.4.0-127-generic
其中:4.4.0-127-generic来自于uname -r的输出
安装了kernel source后再apt install nvidia-384,然后模块就有了:
gemfield@ubuntu:/lib/modules# find . -name "*.ko" | grep -i nvidia
./4.4.0-127-generic/kernel/drivers/video/fbdev/nvidia/nvidiafb.ko
./4.4.0-127-generic/kernel/drivers/net/ethernet/nvidia/forcedeth.ko
./4.4.0-127-generic/updates/dkms/nvidia_384_drm.ko
./4.4.0-127-generic/updates/dkms/nvidia_384.ko
./4.4.0-127-generic/updates/dkms/nvidia_384_uvm.ko
./4.4.0-127-generic/updates/dkms/nvidia_384_modeset.ko
安装CUDA(Docker化的话不需要)
安装CUDA开发库,从官网上选择对应的操作系统和安装方式,我是通过deb的网络方式安装的,先下载了20K的cuda-repo-ubuntu1604_8.0.61-1_amd64.deb包,安装这个包来添加Nvidia的源,然后再apt通过网络大概下载了一个多G:
gemfield@ai:~/Downloads$ sudo dpkg -i cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
安装cuDNN开发库(Docker化的话不需要)
还是去官网上下载一个压缩包,解压放到相应的目录里即可:
gemfield@ai:~/Downloads$ ls cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
cudnn-8.0-linux-x64-v5.1.tgz NVIDIA-Linux-x86_64-381.22.run
gemfield@ai:~/Downloads$ tar zxvf cudnn-8.0-linux-x64-v5.1.tgz
cuda/include/cudnn.h
cuda/lib64/libcudnn.so
cuda/lib64/libcudnn.so.5
cuda/lib64/libcudnn.so.5.1.10
cuda/lib64/libcudnn_static.a
gemfield@ai:~/Downloads/cuda$ sudo cp ./include/cudnn.h /usr/local/cuda/include/
gemfield@ai:~/Downloads/cuda$ ls -l lib64/libcudnn*
lrwxrwxrwx 1 gemfield gemfield 13 11月 7 2016 lib64/libcudnn.so -> libcudnn.so.5
lrwxrwxrwx 1 gemfield gemfield 18 11月 7 2016 lib64/libcudnn.so.5 -> libcudnn.so.5.1.10
-rwxr-xr-x 1 gemfield gemfield 84163560 11月 7 2016 lib64/libcudnn.so.5.1.10
-rw-r--r-- 1 gemfield gemfield 70364814 11月 7 2016 lib64/libcudnn_static.a
gemfield@ai:~/Downloads/cuda$ sudo cp lib64/libcudnn.so.5.1.10 /usr/local/cuda/lib64/
gemfield@ai:~/Downloads/cuda$ sudo cp lib64/libcudnn_static.a /usr/local/cuda/lib64/
gemfield@ai:/usr/local/cuda/lib64$ sudo ln -s libcudnn.so.5.1.10 libcudnn.so.5
gemfield@ai:/usr/local/cuda/lib64$ sudo ln -s libcudnn.so.5 libcudnn.so
如果使用的是Docker化的方式,则无需安装CUDA和cuDNN,可以参考Gemfield:使用nvidia-docker2 来使用Docker。
好了,至此,Nvidia GPU 驱动 + CUDN + cuDNN 安装完毕,好像听说重启的时候驱动冲突什么的会导致系统重启失败。好怕怕啊,我现在在重启之前先把这篇文章发表出来,发表后我就要重启了,有可能就看不到这篇文章了。
=======后记=============
今天在一台新的机器上重新安装的kubuntu 17.04 和Nvidia驱动,结果驱动安装的时候报错:The distribution-provided pre-install script failed! 下面是Log:(如果你忽视这个错误继续强行安装的话,系统下次重启后是不会加载新的nvidia驱动的)
gemfield@civilnet:~/Downloads$ cat /var/log/nvidia-installer.log
nvidia-installer log file '/var/log/nvidia-installer.log'
creation time: Fri Oct 20 10:37:18 2017
installer version: 384.90
PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin
nvidia-installer command line:
./nvidia-installer
Unable to load: nvidia-installer ncurses v6 user interface
Using: nvidia-installer ncurses user interface
-> Detected 12 CPUs online; setting concurrency level to 12.
-> License accepted.
-> Installing NVIDIA driver version 384.90.
-> There appears to already be a driver installed on your system (version: 384.90). As part of installing this driver (version: 384.90), the existing driver will be uninstalled. Are you sure you want to continue? (Answer: Continue installation)
-> Running distribution scripts
executing: '/usr/lib/nvidia/pre-install'...
-> done.
-> The distribution-provided pre-install script failed! Are you sure you want to continue? (Answer: Abort installation)
ERROR: Installation has failed. Please see the file '/var/log/nvidia-installer.log' for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.
原因是Linux kernel已经加载了nouveau驱动了:
gemfield@civilnet:~/Downloads$ lsmod |grep nouv
nouveau 1601536 3
mxm_wmi 16384 1 nouveau
video 40960 1 nouveau
i2c_algo_bit 16384 1 nouveau
ttm 98304 1 nouveau
drm_kms_helper 151552 1 nouveau
drm 352256 6 nouveau,ttm,drm_kms_helper
wmi 16384 2 mxm_wmi,nouveau
gemfield@civilnet:~/Downloads$
得需要把安装后的Kubuntu里自带的nouveau驱动禁止掉,如下所示:
vi /etc/modprobe.d/blacklist-nouveau.conf
写上下面两行:
blacklist nouveau
options nouveau modeset=0
重新产生kernel initramfs:
sudo update-initramfs -u
最后再重启这个系统:
sudo reboot
=======后记2=============
今天运行nvidia-smi的时候,莫名就出现了如下错误:Failed to initialize NVML: Driver/library version mismatch。
gemfield@ThinkPad-X1C:~$ nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
原因是ubuntu上的nvidia驱动又偷偷更新了:
gemfield@ThinkPad-X1C:~$ cat /var/log/dpkg.log | grep nvidia-384
2018-01-10 06:19:23 upgrade nvidia-384:amd64 384.90-0ubuntu0.17.04.2 384.111-0ubuntu0.17.04.1
2018-01-10 06:19:23 status half-configured nvidia-384:amd64 384.90-0ubuntu0.17.04.2
2018-01-10 06:19:26 status unpacked nvidia-384:amd64 384.90-0ubuntu0.17.04.2
2018-01-10 06:19:26 status half-installed nvidia-384:amd64 384.90-0ubuntu0.17.04.2
2018-01-10 06:19:39 status half-installed nvidia-384:amd64 384.90-0ubuntu0.17.04.2
2018-01-10 06:19:39 status unpacked nvidia-384:amd64 384.111-0ubuntu0.17.04.1
2018-01-10 06:19:39 status unpacked nvidia-384:amd64 384.111-0ubuntu0.17.04.1
2018-01-10 06:19:41 upgrade nvidia-384-dev:amd64 384.90-0ubuntu0.17.04.2 384.111-0ubuntu0.17.04.1
2018-01-10 06:19:41 status half-configured nvidia-384-dev:amd64 384.90-0ubuntu0.17.04.2
2018-01-10 06:19:41 status unpacked nvidia-384-dev:amd64 384.90-0ubuntu0.17.04.2
2018-01-10 06:19:41 status half-installed nvidia-384-dev:amd64 384.90-0ubuntu0.17.04.2
2018-01-10 06:19:44 status half-installed nvidia-384-dev:amd64 384.90-0ubuntu0.17.04.2
2018-01-10 06:19:44 status unpacked nvidia-384-dev:amd64 384.111-0ubuntu0.17.04.1
2018-01-10 06:19:44 status unpacked nvidia-384-dev:amd64 384.111-0ubuntu0.17.04.1
2018-01-10 06:19:45 configure nvidia-384:amd64 384.111-0ubuntu0.17.04.1 <无>
2018-01-10 06:19:45 status unpacked nvidia-384:amd64 384.111-0ubuntu0.17.04.1
2018-01-10 06:19:45 status unpacked nvidia-384:amd64 384.111-0ubuntu0.17.04.1
2018-01-10 06:19:45 status half-configured nvidia-384:amd64 384.111-0ubuntu0.17.04.1
2018-01-10 06:19:59 status installed nvidia-384:amd64 384.111-0ubuntu0.17.04.1
2018-01-10 06:20:00 configure nvidia-384-dev:amd64 384.111-0ubuntu0.17.04.1 <无>
2018-01-10 06:20:00 status unpacked nvidia-384-dev:amd64 384.111-0ubuntu0.17.04.1
2018-01-10 06:20:00 status half-configured nvidia-384-dev:amd64 384.111-0ubuntu0.17.04.1
2018-01-10 06:20:00 status installed nvidia-384-dev:amd64 384.111-0ubuntu0.17.04.1
怎么解决呢?重启。
怎么禁止掉ubuntu的自动更新呢?
sudo vi /etc/apt/apt.conf.d/50unattended-upgrades
然后把下面的这一段注释掉:
Unattended-Upgrade::Allowed-Origins {
//"${distro_id}:${distro_codename}";
//"${distro_id}:${distro_codename}-security";
// Extended Security Maintenance; doesn't necessarily exist for
// every release and this system may not have it installed, but if
// available, the policy for updates is such that unattended-upgrades
// should also install from here by default.
//"${distro_id}ESM:${distro_codename}";
// "${distro_id}:${distro_codename}-updates";
// "${distro_id}:${distro_codename}-proposed";
// "${distro_id}:${distro_codename}-backports";
};