我一直在使用带有Tesla K80 GPU的AWS EC2实例来运行TensorFlow代码。我已经安装了CUDA 9.0和cuDNN 7.1.4,并且我使用的是TF 1.12,所有这些都在Ubuntu 16.04上
到昨天为止一切正常,但是今天看来NVidia驱动程序由于某种原因已停止运行:
ubuntu@ip-10-0-0-13:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
我检查了驱动程序:
ubuntu@ip-10-0-0-13:~$ dpkg -l | grep nvidia
rc nvidia-367 367.48-0ubuntu1 amd64 NVIDIA binary driver - version 367.48
ii nvidia-396 396.37-0ubuntu1 amd64 NVIDIA binary driver - version 396.37
ii nvidia-396-dev 396.37-0ubuntu1 amd64 NVIDIA binary Xorg driver development files
ii nvidia-machine-learning-repo-ubuntu1604 1.0.0-1 amd64 nvidia-machine-learning repository configuration files
ii nvidia-modprobe 396.37-0ubuntu1 amd64 Load the NVIDIA kernel driver and create device files
rc nvidia-opencl-icd-367 367.48-0ubuntu1 amd64 NVIDIA OpenCL ICD
ii nvidia-opencl-icd-396 396.37-0ubuntu1 amd64 NVIDIA OpenCL ICD
ii nvidia-prime 0.8.2 amd64 Tools to enable NVIDIA's Prime
ii nvidia-settings 396.37-0ubuntu1 amd64 Tool for configuring the NVIDIA graphics driver
似乎有2个不同的版本存在,这可能是个问题吗?(但我不明白为什么以前一切正常。)
找到此线程后,我检查了我的内核,该内核显然与该线程中提到的内核不同:
ubuntu@ip-10-0-0-13:~$ uname -a
Linux ip-10-0-0-13 4.4.0-143-generic #169-Ubuntu SMP Thu Feb 7 07:56:38 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
有没有人遇到这个问题,知道如何解决?在此先感谢你的帮助 !
编辑:
尝试使用@Dehydrated_Mud的方法升级驱动程序时,出现以下错误:
ERROR: The installation was canceled due to the availability or presence of an alternate driver installation. Please see /var/log/nvidia-installer.log for more details.
以及日志文件的内容:
nvidia-installer log file '/var/log/nvidia-installer.log'
creation time: Thu Mar 21 10:56:46 2019
installer version: 384.183
PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin
nvidia-installer command line:
./nvidia-installer
--no-drm
--disable-nouveau
--dkms
--silent
--install-libglvnd
Using built-in stream user interface
-> Detected 4 CPUs online; setting concurrency level to 4.
-> Installing NVIDIA driver version 384.183.
-> The NVIDIA driver appears to have been installed previously using a different installer. To prevent potential conflicts, it is recommended either to update the existing installation using the same mechanism by which it was originally installed, or to uninstall the existing installation before installing this driver.
Please review the message provided by the maintainer of this alternate installation method and decide how to proceed:
The package that is already installed is named nvidia-396.
You can upgrade the driver by running:
`apt-get install nvidia-396 nvidia-modprobe nvidia-settings`
You can remove nvidia-396, and all related packages, by running:
`apt-get remove --purge nvidia-396 nvidia-modprobe nvidia-settings`
This package is maintained by NVIDIA (cudatools@nvidia.com).
(Answer: Abort installation)
ERROR: The installation was canceled due to the availability or presence of an alternate driver installation. Please see /var/log/nvidia-installer.log for more details.
运行apt-cache search nvidia | grep -P '^nvidia-[0-9]+\s'
给出:
nvidia-331 - Transitional package for nvidia-331
nvidia-346 - Transitional package for nvidia-346
nvidia-304 - NVIDIA legacy binary driver - version 304.135
nvidia-340 - NVIDIA binary driver - version 340.107
nvidia-361 - Transitional package for nvidia-367
nvidia-352 - Transitional package for nvidia-375
nvidia-367 - Transitional package for nvidia-387
nvidia-375 - Transitional package for nvidia-418
nvidia-387 - NVIDIA binary driver - version 387.26
nvidia-418 - NVIDIA binary driver - version 418.39
nvidia-384 - NVIDIA binary driver - version 384.183
nvidia-390 - NVIDIA binary driver - version 390.116
nvidia-410 - NVIDIA binary driver - version 410.104
nvidia-396 - NVIDIA binary driver - version 396.82
我通过更新到最新的Nvidia驱动程序来解决此问题。用:
nvcc --version
获取cuda工具包的版本号。对于9.0,最新的驱动程序是384.183,而CUDA 10.0的驱动程序是410.104。
然后运行:
wget http://us.download.nvidia.com/tesla/384.183/NVIDIA-Linux-x86_64-384.183.run
下载驱动程序。
然后运行:
sudo sh ./NVIDIA-Linux-x86_64-384.183.run --no-drm --disable-nouveau --dkms --silent --install-libglvnd
安装驱动程序。
跑步:
nvidia-smi
检查问题是否已解决。
你好。感谢您的回答,但是由于已经安装了驱动程序,我遇到了一个错误:
ERROR: The installation was canceled due to the availability or presence of an alternate driver installation. Please see /var/log/nvidia-installer.log for more details.
我是否需要删除已经安装的驱动程序?如果这样做,我该怎么做?提前致谢 !嘿。当我尝试运行安装程序时,出现以下错误:
WARNING: One or more modprobe configuration files to disable Nouveau are already present at: /etc/modprobe.d/nvidia-installer-disable-nouveau.conf. Please be sure you have rebooted your system since these files were written....
我也遇到了该警告。安装仍然进行了,并且
nvidia-smi
顺利进行。Alda您可以使用日志文件的相关行以及的输出来更新您的问题apt-cache search nvidia | grep -P '^nvidia-[0-9]+\s'
你好。我更新了所有问题。提前致谢 :)
看来396.x安装阻止了384.x安装的尝试。396是CUDA工具包v9.2(不是9.0)的最新驱动程序。因此,似乎驱动程序工具包版本不匹配导致了您的问题。我建议按照日志文件中的说明删除396,并按照我的答案安装384。看来您的系统上还有其他与396相关的软件包。如果清除396,可能会发现您需要根据需要重新安装其他软件包。