Documentation Index
Fetch the complete documentation index at: https://docs.zylon.ai/llms.txt
Use this file to discover all available pages before exploring further.
This troubleshooting sections provides solutions to common issues encountered with Nvidia drivers on Zylon installations.
Why you might encounter issues with Nvidia drivers
In order to provide latest GPU capabilities for Zylon we usually require the latest versions of the Nvidia drivers installed, which means they have to be compiled on demand for your specific kernel version your hardware is running.
At the same time, open source Nvidia drivers are still under active development and sometimes issues may arise during installation or runtime.
In particular, during kernel updates or changes to the system configuration, Nvidia drivers may stop working properly, failing to detect the GPU or causing the AI services to malfunction due to memory usage discrepancies.
Here are some common ways to diagnose if that is the case and how to fix it.
1 - Check the Zylon Status page
Navigate to the Zylon Status page at https://<your_zylon_domain>/status. Check for any error in the AI Service section and in particular zylon-triton.
If the service is not online, continue with the next steps to diagnose the issue.
If the system is online but Zylon is still failing, the source is most likely and application-level issue, please contact Zylon support.
If you can’t access the status page, skip to step 2.
2 - Verify Nvidia Driver Status
Check nvidia-smi output:
nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.51.03 Driver Version: 575.51.03 CUDA Version: 12.9 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L40S Off | 00000000:30:00.0 Off | 0 |
| N/A 38C P0 104W / 350W | 40673MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 18592 C tritonserver 490MiB |
| 0 N/A N/A 23968 C VLLM::EngineCore 38968MiB |
| 0 N/A N/A 26834 C VLLM::EngineCore 1196MiB |
+-----------------------------------------------------------------------------------------+
It should report the GPU status, along with the process using it.
2.1 - Drivers not working: nvidia-smi can’t communicate with the NVIDIA driver
This is the most common error, usually caused by a kernel update during an unattended upgrade. The fix requires
Run the following commands:
sudo zylon-cli install-drivers --force # Reinstall Nvidia drivers
helm uninstall gpu-operator -n nvidia # Uninstall GPU operator
sudo reboot # Reboot the system to clear any cached GPU info
sudo zylon-cli sync # Reinstall GPU operator when the system is back online
Wait a few minutes and check the status page again, the issue should be resolved. The driver installation will take 10 to 15 minutnes, and additional 3~5 minutes until Triton comes back online.
2.2 - Drivers working: nvidia-smi is working properly but Triton service still failing
If nvidia-smi is working properly but the Triton service is failing (confirm this in the status page), this is usually caused by caching issues regarding GPU autodetection, to fix it run:
helm uninstall gpu-operator -n nvidia
sudo reboot # Reboot the to clear any cached GPU info
sudo zylon-cli sync # Reinstall GPU operator
Wait a few minutes and check the status page again, the issue should be resolved. Note that in this case Triton might take 3~5 minutes to come back online.
2.3 - Drivers working, but fixes for 2.1 and 2.2 did not work
If nvidia-smi is working but the previous fixes did not work, the issue might be located in the Nvidia Container Toolkit installation.
Verify the file located here: /etc/k0s/containerd.d/nvidia.toml exists, and has the following content:
# Allow k0s containerd to use nvidia-container-runtime
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName = "/usr/bin/nvidia-container-runtime"
If the file is missing, you can automatically recreate it by running:
sudo zylon-cli install <desired version>
sudo reboot # Reboot the system
Wait a few minutes and check the status page again, the issue should be resolved.