Ubuntu 16.04上安装Nvidia GPU驱动 + CUDA + cuDNN

背景

本文关于如何在Ubuntu16.04上安装Nvidia GPU驱动。如果要使用docker容器来起AI服务的话,则无需安装CUDA和cuDNN(这是推荐的方式);而如果需要在宿主机上直接启动AI服务,则还需要安装CUDA和cuDNN(这是不推荐的方式)。

Gemfield使用的操作系统是Kubuntu 16.04.02,而Kubuntu 16.04是一个LTS版本,所以后续还会有新的patch release。

告警1:有的机器需要禁止掉bios的seurity功能才能重新启动,否则重启进入不了系统(尚不知道原因)。

安装Nvidia GPU驱动

无论是否使用Docker化的方式,都必须进行这一步。有两种方式:离线安装和在线安装:

1,在线安装(推荐)

gemfield@localhost:~$ sudo apt install nvidia-384-dev nvidia-modprobe

2,离线安装(不推荐)

安装Nvidia GPU 驱动,从其官网上下载安装程序(一段shell),我下载的时候是NVIDIA-Linux-x86_64-381.22.run,下载后运行。

一个错误分享

今天按照直接apt的方式直接在线安装nvidia-384-dev,出现错误:NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

本质错误其实是nvidia驱动没有被内核加载:modprobe: ERROR: could not insert ‘nvidia_384’: Unknown symbol in module, or unknown parameter (see dmesg)。

命令行安装完成后看看驱动是否已经到位,下面这种情况就没有nvidia.ko:

gemfield@ubuntu:/lib/modules# find . -name "*.ko" | grep -i nvidia
./4.4.0-127-generic/kernel/drivers/video/fbdev/nvidia/nvidiafb.ko
./4.4.0-127-generic/kernel/drivers/net/ethernet/nvidia/forcedeth.ko

那是因为apt的时候已经提示了:没有安装kernel source!

Module build for the currently running kernel was skipped since the
kernel source for this kernel does not seem to be installed.

安装kernel source的方法:

sudo apt-get install linux-source
sudo apt-get install linux-headers-4.4.0-127-generic
其中:4.4.0-127-generic来自于uname -r的输出

安装了kernel source后再apt install nvidia-384,然后模块就有了:

gemfield@ubuntu:/lib/modules# find . -name "*.ko" | grep -i nvidia
./4.4.0-127-generic/kernel/drivers/video/fbdev/nvidia/nvidiafb.ko
./4.4.0-127-generic/kernel/drivers/net/ethernet/nvidia/forcedeth.ko
./4.4.0-127-generic/updates/dkms/nvidia_384_drm.ko
./4.4.0-127-generic/updates/dkms/nvidia_384.ko
./4.4.0-127-generic/updates/dkms/nvidia_384_uvm.ko
./4.4.0-127-generic/updates/dkms/nvidia_384_modeset.ko

安装CUDA(Docker化的话不需要)

安装CUDA开发库,从官网上选择对应的操作系统和安装方式,我是通过deb的网络方式安装的,先下载了20K的cuda-repo-ubuntu1604_8.0.61-1_amd64.deb包,安装这个包来添加Nvidia的源,然后再apt通过网络大概下载了一个多G:

gemfield@ai:~/Downloads$ sudo dpkg -i cuda-repo-ubuntu1604_8.0.61-1_amd64.deb 

安装cuDNN开发库(Docker化的话不需要)

还是去官网上下载一个压缩包,解压放到相应的目录里即可:

gemfield@ai:~/Downloads$ ls cuda-repo-ubuntu1604_8.0.61-1_amd64.deb  
cudnn-8.0-linux-x64-v5.1.tgz  NVIDIA-Linux-x86_64-381.22.run 
gemfield@ai:~/Downloads$ tar zxvf cudnn-8.0-linux-x64-v5.1.tgz 
cuda/include/cudnn.h
cuda/lib64/libcudnn.so
cuda/lib64/libcudnn.so.5
cuda/lib64/libcudnn.so.5.1.10
cuda/lib64/libcudnn_static.a

gemfield@ai:~/Downloads/cuda$ sudo cp ./include/cudnn.h /usr/local/cuda/include/
gemfield@ai:~/Downloads/cuda$ ls -l lib64/libcudnn*
lrwxrwxrwx 1 gemfield gemfield       13 11月  7  2016 lib64/libcudnn.so -> libcudnn.so.5
lrwxrwxrwx 1 gemfield gemfield       18 11月  7  2016 lib64/libcudnn.so.5 -> libcudnn.so.5.1.10
-rwxr-xr-x 1 gemfield gemfield 84163560 11月  7  2016 lib64/libcudnn.so.5.1.10                                                                                                               
-rw-r--r-- 1 gemfield gemfield 70364814 11月  7  2016 lib64/libcudnn_static.a
                                                                                                               
gemfield@ai:~/Downloads/cuda$ sudo cp lib64/libcudnn.so.5.1.10 /usr/local/cuda/lib64/                                                                                                        
gemfield@ai:~/Downloads/cuda$ sudo cp lib64/libcudnn_static.a /usr/local/cuda/lib64/
gemfield@ai:/usr/local/cuda/lib64$ sudo ln -s libcudnn.so.5.1.10 libcudnn.so.5
gemfield@ai:/usr/local/cuda/lib64$ sudo ln -s libcudnn.so.5 libcudnn.so

如果使用的是Docker化的方式,则无需安装CUDA和cuDNN,可以参考Gemfield:使用nvidia-docker2 来使用Docker。

好了,至此,Nvidia GPU 驱动 + CUDN + cuDNN 安装完毕,好像听说重启的时候驱动冲突什么的会导致系统重启失败。好怕怕啊,我现在在重启之前先把这篇文章发表出来,发表后我就要重启了,有可能就看不到这篇文章了。

=======后记=============

今天在一台新的机器上重新安装的kubuntu 17.04 和Nvidia驱动,结果驱动安装的时候报错:The distribution-provided pre-install script failed! 下面是Log:(如果你忽视这个错误继续强行安装的话,系统下次重启后是不会加载新的nvidia驱动的)

gemfield@civilnet:~/Downloads$ cat /var/log/nvidia-installer.log
nvidia-installer log file '/var/log/nvidia-installer.log'
creation time: Fri Oct 20 10:37:18 2017
installer version: 384.90

PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin

nvidia-installer command line:
    ./nvidia-installer

Unable to load: nvidia-installer ncurses v6 user interface

Using: nvidia-installer ncurses user interface
-> Detected 12 CPUs online; setting concurrency level to 12.
-> License accepted.
-> Installing NVIDIA driver version 384.90.
-> There appears to already be a driver installed on your system (version: 384.90).  As part of installing this driver (version: 384.90), the existing driver will be uninstalled.  Are you sure you want to continue? (Answer: Continue installation)
-> Running distribution scripts
   executing: '/usr/lib/nvidia/pre-install'...
-> done.
-> The distribution-provided pre-install script failed!  Are you sure you want to continue? (Answer: Abort installation)
ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

原因是Linux kernel已经加载了nouveau驱动了:

gemfield@civilnet:~/Downloads$ lsmod |grep nouv
nouveau              1601536  3
mxm_wmi                16384  1 nouveau
video                  40960  1 nouveau
i2c_algo_bit           16384  1 nouveau
ttm                    98304  1 nouveau
drm_kms_helper        151552  1 nouveau
drm                   352256  6 nouveau,ttm,drm_kms_helper
wmi                    16384  2 mxm_wmi,nouveau
gemfield@civilnet:~/Downloads$

得需要把安装后的Kubuntu里自带的nouveau驱动禁止掉,如下所示:

vi /etc/modprobe.d/blacklist-nouveau.conf

写上下面两行:

blacklist nouveau
options nouveau modeset=0

重新产生kernel initramfs:

sudo update-initramfs -u

最后再重启这个系统:

sudo reboot

=======后记2=============

今天运行nvidia-smi的时候,莫名就出现了如下错误:Failed to initialize NVML: Driver/library version mismatch。

gemfield@ThinkPad-X1C:~$  nvidia-smi
Failed to initialize NVML: Driver/library version mismatch

原因是ubuntu上的nvidia驱动又偷偷更新了:

gemfield@ThinkPad-X1C:~$ cat /var/log/dpkg.log | grep nvidia-384
2018-01-10 06:19:23 upgrade nvidia-384:amd64 384.90-0ubuntu0.17.04.2 384.111-0ubuntu0.17.04.1
2018-01-10 06:19:23 status half-configured nvidia-384:amd64 384.90-0ubuntu0.17.04.2
2018-01-10 06:19:26 status unpacked nvidia-384:amd64 384.90-0ubuntu0.17.04.2
2018-01-10 06:19:26 status half-installed nvidia-384:amd64 384.90-0ubuntu0.17.04.2
2018-01-10 06:19:39 status half-installed nvidia-384:amd64 384.90-0ubuntu0.17.04.2
2018-01-10 06:19:39 status unpacked nvidia-384:amd64 384.111-0ubuntu0.17.04.1
2018-01-10 06:19:39 status unpacked nvidia-384:amd64 384.111-0ubuntu0.17.04.1
2018-01-10 06:19:41 upgrade nvidia-384-dev:amd64 384.90-0ubuntu0.17.04.2 384.111-0ubuntu0.17.04.1
2018-01-10 06:19:41 status half-configured nvidia-384-dev:amd64 384.90-0ubuntu0.17.04.2
2018-01-10 06:19:41 status unpacked nvidia-384-dev:amd64 384.90-0ubuntu0.17.04.2
2018-01-10 06:19:41 status half-installed nvidia-384-dev:amd64 384.90-0ubuntu0.17.04.2
2018-01-10 06:19:44 status half-installed nvidia-384-dev:amd64 384.90-0ubuntu0.17.04.2
2018-01-10 06:19:44 status unpacked nvidia-384-dev:amd64 384.111-0ubuntu0.17.04.1
2018-01-10 06:19:44 status unpacked nvidia-384-dev:amd64 384.111-0ubuntu0.17.04.1
2018-01-10 06:19:45 configure nvidia-384:amd64 384.111-0ubuntu0.17.04.1 <无>
2018-01-10 06:19:45 status unpacked nvidia-384:amd64 384.111-0ubuntu0.17.04.1
2018-01-10 06:19:45 status unpacked nvidia-384:amd64 384.111-0ubuntu0.17.04.1
2018-01-10 06:19:45 status half-configured nvidia-384:amd64 384.111-0ubuntu0.17.04.1
2018-01-10 06:19:59 status installed nvidia-384:amd64 384.111-0ubuntu0.17.04.1
2018-01-10 06:20:00 configure nvidia-384-dev:amd64 384.111-0ubuntu0.17.04.1 <无>
2018-01-10 06:20:00 status unpacked nvidia-384-dev:amd64 384.111-0ubuntu0.17.04.1
2018-01-10 06:20:00 status half-configured nvidia-384-dev:amd64 384.111-0ubuntu0.17.04.1
2018-01-10 06:20:00 status installed nvidia-384-dev:amd64 384.111-0ubuntu0.17.04.1

怎么解决呢?重启。

怎么禁止掉ubuntu的自动更新呢?

sudo vi /etc/apt/apt.conf.d/50unattended-upgrades

然后把下面的这一段注释掉:

Unattended-Upgrade::Allowed-Origins {
        //"${distro_id}:${distro_codename}";
        //"${distro_id}:${distro_codename}-security";
        // Extended Security Maintenance; doesn't necessarily exist for
        // every release and this system may not have it installed, but if
        // available, the policy for updates is such that unattended-upgrades
        // should also install from here by default.
        //"${distro_id}ESM:${distro_codename}";
//      "${distro_id}:${distro_codename}-updates";
//      "${distro_id}:${distro_codename}-proposed";
//      "${distro_id}:${distro_codename}-backports";
};

    原文作者:Gemfield
    原文地址: https://zhuanlan.zhihu.com/p/28786117
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞