To resolve this issue, we recommend disabling automatically installed updates and then reinstalling the Nvidia driver and libraries.
Symptoms:
Failed to initialize NVML: driver/library version mismatch
nvidia-smi failure
root@instance-1:/var/log/apt# nvidia-smi` `Failed to initialize NVML: Driver/library version mismatch
NVML library version: 535.113
- This is caused by a version mismatch between the NVIDIA GPU driver and the NVML library.
NVRM: API mismatch
/var/log/sysloglog shows NVRM error
NVRM: API mismatch: the client has the version 535.113.01, but
this kernel module has the version 535.104.12.
- API mismatch between the NVIDIA GPU driver and the client.
Cause of issue:
This mismatch is caused by automatic updates installed on a new driver version.
Diagnostic Commands
grep NVRM /var/log/syslog
dpkg -l |egrep "cuda|nvidia" -i
dkms status
- These commands gather information about the NVIDIA drivers, kernel modules, and related packages on the system, helpful for troubleshooting.
Solution:
-
We suggest turning off automatically installed updates. To do this on Ubuntu machines, execute sudo dpkg-reconfigure unattended-upgrades and choose "no" when prompted.
-
Execute the following commands to re-install Nvidia-driver, and libraries, and reboot the instance.
sudo apt purge nvidia* libnvidia*
# Update the version number 535 to the correct version, usually the driver version reported in the logs.
sudo apt install nvidia-driver-535
reboot now -
the nvidia-smi command again, it should now work.
-
If you receive this error:
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running
Execute the following commands to re-install CUDA libraries and reboot again.
sudo apt-get update
sudo apt-get -y install cuda
reboot now
* The driver should function without problems after boot; verify this with the `nvidia-smi` command.