Skip to content

Commit 129421e

Browse files
Merge pull request #11 from balena-io-examples/alanb128-2024-patch2
Add troubleshooting section and change artifact
2 parents 8c8b320 + e64d804 commit 129421e

File tree

3 files changed

+42
-1
lines changed

3 files changed

+42
-1
lines changed

README.md

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
# nvidia-x86 on balena
22
Example of using an Nvidia GPU in an x86 device on the balena platform. See the accompanying [blog post](https://www.balena.io/blog/how-to-use-nvidia-gpu-on-x86-device-balenaOS/) for more details!
33

4+
<img src="video_balena_x86.jpg">
5+
46
Note that although these examples should work as-is, the resulting images are quite large and should be optimized for your particular use case. One possibility is to utilize [multistage builds](https://www.balena.io/docs/learn/deploy/build-optimization/#multi-stage-builds) to reduce the size of your containers. Below is a summary of the containers in this project, with all of the details following in the next section.
57

68
| Service | Image Size | Description |
@@ -14,6 +16,8 @@ Note that although these examples should work as-is, the resulting images are qu
1416
### gpu container
1517
This is the main container in this example and the only one you need to obtain GPU access from within your container or for any other containers in the application. It downloads the kernel source files for our exact OS version and uses them, along with the driver file downloaded from Nvidia to build the required Nvidia kernel modules. Finally, the `entry.sh` file unloads the current Nouveau driver if it's running and loads the Nvidia modules.
1618

19+
In some cases, the device may have trouble unloading the Noveau driver. See [this post](https://forums.balena.io/t/blacklist-drivers-in-host-os/163437/25) for an alternate script that may be helpful.
20+
1721
This container also provides CUDA compiled application support, though not development support - see the CUDA container example for development use. You could use this image as a base image, build on top of the current example, or use alongside other containers to provide them with gpu access.
1822

1923
Before using this example, you'll need to make sure that you've set the variables at the top of the Dockerfile:
@@ -22,6 +26,8 @@ Before using this example, you'll need to make sure that you've set the variable
2226
- `YOCTO_VERSION` is the version of Yocto Linux used to build your version of balenaOS. You can find it by logging into your host OS and typing: `uname -r`
2327
- `NVIDIA_DRIVER_VERSION` is the version of the Nvidia driver you want to download and build using the list found [here]( https://www.nvidia.com/en-us/drivers/unix/) Usually, you can use the "Latest Production Branch Version". Be sure to use the exact same driver version in any other containers that need access to the GPU.
2428

29+
NOTE: If you are trying to use this example with balenaOS < 3.0 change any occurences of `kernel_modules_headers` (such as in line 30 of the gpu Dockerfile) to `kernel-source`.
30+
2531
If this container is set up and running properly, you should see the output below (for your gpu model) in the terminal:
2632
```
2733
+-----------------------------------------------------------------------------+
@@ -122,3 +128,38 @@ This is an example of using a pre-built container from the [Nvidia NGC Catalog](
122128
nv-pytorch Torch CUDA device count: 1
123129
nv-pytorch Torch CUDA device name: Quadro P400
124130
```
131+
132+
## Troubleshooting
133+
134+
- If you see errors such as:
135+
```
136+
gpu insmod: ERROR: could not insert module /nvidia/driver/nvidia.ko: No such device
137+
gpu insmod: ERROR: could not insert module /nvidia/driver/nvidia-modeset.ko: Unknown symbol in module
138+
gpu insmod: ERROR: could not insert module /nvidia/driver/nvidia-uvm.ko: Unknown symbol in module
139+
gpu NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
140+
```
141+
It's likely the driver version you specified is not compatible with your hardware. Make sure the `NVIDIA_DRIVER_VERSION` you specify from the "Linux x86_64/AMD64/EM64T" section of the list on [this page](https://www.nvidia.com/en-us/drivers/unix/) shows your NVIDIA GPU under the "Supported Products" tab when clicking on the link for the driver version in the list.
142+
143+
- If you see these errors:
144+
```
145+
gpu insmod: ERROR: could not insert module /nvidia/driver/nvidia.ko: Invalid module format
146+
gpu insmod: ERROR: could not insert module /nvidia/driver/nvidia-modeset.ko: Invalid module format
147+
gpu insmod: ERROR: could not insert module /nvidia/driver/nvidia-uvm.ko: Invalid module format
148+
```
149+
It usually indicates a mismatch between the balenaOS version specified by the `VERSION` variable and the Yocto version of the OS specified by the `YOCTO_KERNEL` variable. Please re-check the variable values per the definitions in the table above.
150+
151+
- If you see these errors:
152+
```
153+
gpu rmmod: ERROR: Module nouveau is not currently loaded
154+
gpu insmod: ERROR: could not insert module /nvidia/driver/nvidia.ko: File exists
155+
gpu insmod: ERROR: could not insert module /nvidia/driver/nvidia-modeset.ko: File exists
156+
gpu insmod: ERROR: could not insert module /nvidia/driver/nvidia-uvm.ko: File exists
157+
```
158+
As long as the proper GPU information output (shown above in the gpu container section) is displayed, you can generally ignore these messages. It usually occurs when you have pushed a new release and the startup scripts run more than once. Rebooting should clear the error.
159+
160+
- If you see the following error or similar during the build process:
161+
```
162+
The command '/bin/bash -o pipefail -c curl -fsSL "https://files.balena-cloud.com/images/${BALENA_MACHINE_NAME}/${VERSION}/kernel_modules_headers.tar.gz" | tar xz --strip-components=2 && make -C build modules_prepare -j"$(nproc)"' returned a non-zero code: 2
163+
```
164+
165+
Starting in balenaOS 3.0 the `kernel-source` kernel headers artifact was renamed to `kernel-modules-headers`. If you are trying to use this example with balenaOS < 3.0 change any occurences of `kernel_modules_headers` (such as in line 30 of the gpu Dockerfile) to `kernel-source`.

gpu/Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ SHELL ["/bin/bash", "-o", "pipefail", "-c"]
2727

2828
# Download the kernel source then prepare kernel source to build a module.
2929
RUN \
30-
curl -fsSL "https://files.balena-cloud.com/images/${BALENA_MACHINE_NAME}/${VERSION}/kernel_source.tar.gz" \
30+
curl -fsSL "https://files.balena-cloud.com/images/${BALENA_MACHINE_NAME}/${VERSION}/kernel-modules-headers.tar.gz" \
3131
| tar xz --strip-components=2 && \
3232
make -C build modules_prepare -j"$(nproc)"
3333

video_balena_x86.jpg

196 KB
Loading

0 commit comments

Comments
 (0)