You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+41Lines changed: 41 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,8 @@
1
1
# nvidia-x86 on balena
2
2
Example of using an Nvidia GPU in an x86 device on the balena platform. See the accompanying [blog post](https://www.balena.io/blog/how-to-use-nvidia-gpu-on-x86-device-balenaOS/) for more details!
3
3
4
+
<imgsrc="video_balena_x86.jpg">
5
+
4
6
Note that although these examples should work as-is, the resulting images are quite large and should be optimized for your particular use case. One possibility is to utilize [multistage builds](https://www.balena.io/docs/learn/deploy/build-optimization/#multi-stage-builds) to reduce the size of your containers. Below is a summary of the containers in this project, with all of the details following in the next section.
5
7
6
8
| Service | Image Size | Description |
@@ -14,6 +16,8 @@ Note that although these examples should work as-is, the resulting images are qu
14
16
### gpu container
15
17
This is the main container in this example and the only one you need to obtain GPU access from within your container or for any other containers in the application. It downloads the kernel source files for our exact OS version and uses them, along with the driver file downloaded from Nvidia to build the required Nvidia kernel modules. Finally, the `entry.sh` file unloads the current Nouveau driver if it's running and loads the Nvidia modules.
16
18
19
+
In some cases, the device may have trouble unloading the Noveau driver. See [this post](https://forums.balena.io/t/blacklist-drivers-in-host-os/163437/25) for an alternate script that may be helpful.
20
+
17
21
This container also provides CUDA compiled application support, though not development support - see the CUDA container example for development use. You could use this image as a base image, build on top of the current example, or use alongside other containers to provide them with gpu access.
18
22
19
23
Before using this example, you'll need to make sure that you've set the variables at the top of the Dockerfile:
@@ -22,6 +26,8 @@ Before using this example, you'll need to make sure that you've set the variable
22
26
-`YOCTO_VERSION` is the version of Yocto Linux used to build your version of balenaOS. You can find it by logging into your host OS and typing: `uname -r`
23
27
-`NVIDIA_DRIVER_VERSION` is the version of the Nvidia driver you want to download and build using the list found [here](https://www.nvidia.com/en-us/drivers/unix/) Usually, you can use the "Latest Production Branch Version". Be sure to use the exact same driver version in any other containers that need access to the GPU.
24
28
29
+
NOTE: If you are trying to use this example with balenaOS < 3.0 change any occurences of `kernel_modules_headers` (such as in line 30 of the gpu Dockerfile) to `kernel-source`.
30
+
25
31
If this container is set up and running properly, you should see the output below (for your gpu model) in the terminal:
@@ -122,3 +128,38 @@ This is an example of using a pre-built container from the [Nvidia NGC Catalog](
122
128
nv-pytorch Torch CUDA device count: 1
123
129
nv-pytorch Torch CUDA device name: Quadro P400
124
130
```
131
+
132
+
## Troubleshooting
133
+
134
+
- If you see errors such as:
135
+
```
136
+
gpu insmod: ERROR: could not insert module /nvidia/driver/nvidia.ko: No such device
137
+
gpu insmod: ERROR: could not insert module /nvidia/driver/nvidia-modeset.ko: Unknown symbol in module
138
+
gpu insmod: ERROR: could not insert module /nvidia/driver/nvidia-uvm.ko: Unknown symbol in module
139
+
gpu NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
140
+
```
141
+
It's likely the driver version you specified is not compatible with your hardware. Make sure the `NVIDIA_DRIVER_VERSION` you specify from the "Linux x86_64/AMD64/EM64T" section of the list on [this page](https://www.nvidia.com/en-us/drivers/unix/) shows your NVIDIA GPU under the "Supported Products" tab when clicking on the link for the driver version in the list.
142
+
143
+
- If you see these errors:
144
+
```
145
+
gpu insmod: ERROR: could not insert module /nvidia/driver/nvidia.ko: Invalid module format
146
+
gpu insmod: ERROR: could not insert module /nvidia/driver/nvidia-modeset.ko: Invalid module format
147
+
gpu insmod: ERROR: could not insert module /nvidia/driver/nvidia-uvm.ko: Invalid module format
148
+
```
149
+
It usually indicates a mismatch between the balenaOS version specified by the `VERSION` variable and the Yocto version of the OS specified by the `YOCTO_KERNEL` variable. Please re-check the variable values per the definitions in the table above.
150
+
151
+
- If you see these errors:
152
+
```
153
+
gpu rmmod: ERROR: Module nouveau is not currently loaded
154
+
gpu insmod: ERROR: could not insert module /nvidia/driver/nvidia.ko: File exists
155
+
gpu insmod: ERROR: could not insert module /nvidia/driver/nvidia-modeset.ko: File exists
156
+
gpu insmod: ERROR: could not insert module /nvidia/driver/nvidia-uvm.ko: File exists
157
+
```
158
+
As long as the proper GPU information output (shown above in the gpu container section) is displayed, you can generally ignore these messages. It usually occurs when you have pushed a new release and the startup scripts run more than once. Rebooting should clear the error.
159
+
160
+
- If you see the following error or similar during the build process:
161
+
```
162
+
The command '/bin/bash -o pipefail -c curl -fsSL "https://files.balena-cloud.com/images/${BALENA_MACHINE_NAME}/${VERSION}/kernel_modules_headers.tar.gz" | tar xz --strip-components=2 && make -C build modules_prepare -j"$(nproc)"' returned a non-zero code: 2
163
+
```
164
+
165
+
Starting in balenaOS 3.0 the `kernel-source` kernel headers artifact was renamed to `kernel-modules-headers`. If you are trying to use this example with balenaOS < 3.0 change any occurences of `kernel_modules_headers` (such as in line 30 of the gpu Dockerfile) to `kernel-source`.
0 commit comments