Skip to main content
This troubleshooting sections provides solutions to common issues encountered with Nvidia drivers on Zylon installations.

Why you might encounter issues with Nvidia drivers

In order to provide latest GPU capabilities for Zylon we usually require the latest versions of the Nvidia drivers installed, which means they have to be compiled on demand for your specific kernel version your hardware is running. At the same time, open source Nvidia drivers are still under active development and sometimes issues may arise during installation or runtime. In particular, during kernel updates or changes to the system configuration, Nvidia drivers may stop working properly, failing to detect the GPU or causing the AI services to malfunction due to memory usage discrepancies. Here are some common ways to diagnose if that is the case and how to fix it.

1 - Check the Zylon Status page

Navigate to the Zylon Status page at https://<your_zylon_domain>/status. Check for any error in the AI Service section and in particular zylon-triton. If the service is not online, continue with the next steps to diagnose the issue. If the system is online but Zylon is still failing, the source is most likely and application-level issue, please contact Zylon support. If you can’t access the status page, skip to step 2.

2 - Verify Nvidia Driver Status

Check nvidia-smi output:
nvidia-smi

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.51.03              Driver Version: 575.51.03      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    Off |   00000000:30:00.0 Off |                    0 |
| N/A   38C    P0            104W /  350W |   40673MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A           18592      C   tritonserver                            490MiB |
|    0   N/A  N/A           23968      C   VLLM::EngineCore                      38968MiB |
|    0   N/A  N/A           26834      C   VLLM::EngineCore                       1196MiB |
+-----------------------------------------------------------------------------------------+
It should report the GPU status, along with the process using it.

2.1 - Drivers not working: nvidia-smi can’t communicate with the NVIDIA driver

This is the most common error, usually caused by a kernel update during an unattended upgrade. The fix requires Run the following commands:
sudo zylon-cli install-drivers --force # Reinstall Nvidia drivers
helm uninstall gpu-operator -n nvidia # Uninstall GPU operator
sudo reboot # Reboot the system to clear any cached GPU info
sudo zylon-cli update # Reinstall GPU operator when the system is back online
Wait a few minutes and check the status page again, the issue should be resolved. The driver installation will take 10 to 15 minutnes, and additional 3~5 minutes until Triton comes back online.

2.2 - Drivers working: nvidia-smi is working properly but Triton service still failing

If nvidia-smi is working properly but the Triton service is failing (confirm this in the status page), this is usually caused by caching issues regarding GPU autodetection, to fix it run:
helm uninstall gpu-operator -n nvidia
sudo reboot # Reboot the to clear any cached GPU info
sudo zylon-cli update # Reinstall GPU operator
Wait a few minutes and check the status page again, the issue should be resolved. Note that in this case Triton might take 3~5 minutes to come back online.

2.3 - Drivers working, but fixes for 2.1 and 2.2 did not work

If nvidia-smi is working but the previous fixes did not work, the issue might be located in the Nvidia Container Toolkit installation. Verify the file located here: /etc/k0s/containerd.d/nvidia.toml exists, and has the following content:
# Allow k0s containerd to use nvidia-container-runtime
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName = "/usr/bin/nvidia-container-runtime"
If the file is missing, you can automatically recreate it by running:
sudo zylon-cli setup
sudo reboot # Reboot the system
Wait a few minutes and check the status page again, the issue should be resolved.