Skip to content

Commit 3feac74

Browse files
ravi9wine99
authored andcommitted
Update OV dockerfile to use OV2025.3 and update build docs
1 parent 4c280cc commit 3feac74

File tree

2 files changed

+63
-4
lines changed

2 files changed

+63
-4
lines changed

.devops/openvino.Dockerfile

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
1-
ARG OPENVINO_VERSION_MAJOR=2025.2
2-
ARG OPENVINO_VERSION_FULL=2025.2.0.19140.c01cd93e24d
1+
ARG OPENVINO_VERSION_MAJOR=2025.3
2+
ARG OPENVINO_VERSION_FULL=2025.3.0.19807.44526285f24
33
ARG UBUNTU_VERSION=24.04
44

55
# Optional proxy build arguments - empty by default

docs/build.md

Lines changed: 61 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -614,7 +614,7 @@ Follow the instructions below to install OpenVINO runtime and build llama.cpp wi
614614
- Follow the guide to install OpenVINO Runtime from an archive file: [Linux](https://docs.openvino.ai/2025/get-started/install-openvino/install-openvino-archive-linux.html) | [Windows](https://docs.openvino.ai/2025/get-started/install-openvino/install-openvino-archive-windows.html)
615615

616616
<details>
617-
<summary>📦 Click to expand OpenVINO 2025.3 installation on Ubuntu</summary>
617+
<summary>📦 Click to expand OpenVINO 2025.3 installation from an archive file on Ubuntu</summary>
618618
<br>
619619

620620
```bash
@@ -700,9 +700,68 @@ Control OpenVINO behavior using these environment variables:
700700
export GGML_OPENVINO_CACHE_DIR=/tmp/ov_cache
701701
export GGML_OPENVINO_PROFILING=1
702702

703-
./build/ReleaseOV/bin/llama-simple -m ~/models/Llama-3.2-1B-Instruct.fp16.gguf -n 50 "The story of AI is "
703+
GGML_OPENVINO_DEVICE=GPU ./build/ReleaseOV/bin/llama-simple -m ~/models/Llama-3.2-1B-Instruct.fp16.gguf -n 50 "The story of AI is "
704+
```
705+
706+
### Docker build Llama.cpp with OpenVINO Backend
707+
You can build and run llama.cpp with OpenVINO backend using Docker.
708+
709+
```bash
710+
# Build the base runtime image with compiled shared libraries and minimal dependencies.
711+
docker build -t llama-openvino:base -f .devops/openvino.Dockerfile .
712+
713+
# Build the complete image with all binaries, Python tools, gguf-py library, and model conversion utilities.
714+
docker build --target=full -t llama-openvino:full -f .devops/openvino.Dockerfile .
715+
716+
# Build a minimal CLI-only image containing just the llama-cli executable.
717+
docker build --target=light -t llama-openvino:light -f .devops/openvino.Dockerfile .
718+
719+
# Builds a server-only image with llama-server executable, health check endpoint, and REST API support.
720+
docker build --target=server -t llama-openvino:server -f .devops/openvino.Dockerfile .
721+
722+
# If you are behind a proxy:
723+
docker build --build-arg http_proxy=$http_proxy --build-arg https_proxy=$https_proxy --target=light --t llama-openvino:light -f .devops/openvino.Dockerfile .
704724
```
705725

726+
Run llama.cpp with OpenVINO backend Docker container.
727+
Save sample models in `~/models` as [shown above](#3-download-sample-model). It will be mounted to the container in the examples below.
728+
729+
```bash
730+
# Run Docker container
731+
docker run --rm -it -v ~/models:/models llama-openvino:light --no-warmup -m /models/Llama-3.2-1B-Instruct.fp16.gguf
732+
733+
# With Intel GPU access (iGPU or dGPU)
734+
docker run --rm -it -v ~/models:/models \
735+
--device=/dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -u $(id -u):$(id -g) \
736+
llama-openvino:light --no-warmup -m /models/Llama-3.2-1B-Instruct.fp16.gguf
737+
738+
# With Intel NPU access
739+
docker run --rm -it --env GGML_OPENVINO_DEVICE=NPU -v ~/models:/models \
740+
--device=/dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -u $(id -u):$(id -g) \
741+
llama-openvino:light --no-warmup -m /models/Llama-3.2-1B-Instruct.fp16.gguf
742+
```
743+
744+
Run Llama.cpp Server with OpenVINO Backend
745+
```bash
746+
# Run the Server Docker container server
747+
docker run --rm -it -p 8080:8080 -v ~/models:/models llama-openvino:server --no-warmup -m /models/Llama-3.2-1B-Instruct.fp16.gguf
748+
749+
# In a NEW terminal, test the server with curl
750+
751+
# If you are behind a proxy, make sure to set NO_PROXY to avoid proxy for localhost
752+
export NO_PROXY=localhost,127.0.0.1
753+
754+
# Test health endpoint
755+
curl -f http://localhost:8080/health
756+
757+
# Test with a simple prompt
758+
curl -X POST "http://localhost:8080/v1/chat/completions" -H "Content-Type: application/json" \
759+
-d '{"messages":[{"role":"user","content":"Write a poem about OpenVINO"}],"max_tokens":100}' | jq .
760+
761+
```
762+
763+
764+
---
706765
## Notes about GPU-accelerated backends
707766

708767
The GPU may still be used to accelerate some parts of the computation even when using the `-ngl 0` option. You can fully disable GPU acceleration by using `--device none`.

0 commit comments

Comments
 (0)