Skip to content

Conversation

@FuchtelJockel
Copy link

Can't test/add ROCm because I have no compatible GPU.

Copy link

@mcowger mcowger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I converted the buildah script to a standard Dockerfile below. After adding in libcurl, it does start and download a model, but loading the model fails. This same model runs fine in docker on llama.cpp on this machine.

Most interesting is the stage at which it begins the tensor load:

DEBUG:    LLAMA SERVER GPU: load_tensors: loading model tensors, this can take a while... (mmap = true)
DEBUG:    LLAMA SERVER GPU: llama_model_load: error loading model: make_cpu_buft_list: no CPU backend found

I've never seen this error from llama.cpp before: make_cpu_buft_list: no CPU backend found

INFO:     Loading llm: Qwen3-0.6B-GGUF
INFO:     Using backend: vulkan
INFO:     Downloading llama.cpp server from https://github.com/ggml-org/llama.cpp/releases/download/b6097/llama-b6097-bin-ubuntu-vulkan-x64.zip
DEBUG:    Starting new HTTPS connection (1): github.com:443
DEBUG:    https://github.com:443 "GET /ggml-org/llama.cpp/releases/download/b6097/llama-b6097-bin-ubuntu-vulkan-x64.zip HTTP/1.1" 302 0
DEBUG:    Starting new HTTPS connection (1): release-assets.githubusercontent.com:443
DEBUG:    https://release-assets.githubusercontent.com:443 "GET /github-production-release-asset/612354784/7049d9fd-7769-4f66-953d-584aaadce81c?sp=r&sv=2018-11-09&sr=b&spr=https&se=2025-08-25T07%3A22%3A02Z&rscd=attachment%3B+filename%3Dllama-b6097-bin-ubuntu-vulkan-x64.zip&rsct=application%2Foctet-stream&skoid=96c2d410-5711-43a1-aedd-ab1947aa7ab0&sktid=398a6654-997b-47e9-b12b-9515b896b4de&skt=2025-08-25T06%3A21%3A11Z&ske=2025-08-25T07%3A22%3A02Z&sks=b&skv=2018-11-09&sig=sA9x1qN2%2FPVDlHgu9K8D28FdMo%2FIGlBGiSXvV6szdgM%3D&jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmVsZWFzZS1hc3NldHMuZ2l0aHVidXNlcmNvbnRlbnQuY29tIiwia2V5Ijoia2V5MSIsImV4cCI6MTc1NjEwMzgwMCwibmJmIjoxNzU2MTAzNTAwLCJwYXRoIjoicmVsZWFzZWFzc2V0cHJvZHVjdGlvbi5ibG9iLmNvcmUud2luZG93cy5uZXQifQ.kg9u9BjrIuiqD-P4vre6zn3kqOu_D3uKR6bGn9xlcxk&response-content-disposition=attachment%3B%20filename%3Dllama-b6097-bin-ubuntu-vulkan-x64.zip&response-content-type=application%2Foctet-stream HTTP/1.1" 200 22417582
INFO:     Extracting llama-b6097-bin-ubuntu-vulkan-x64.zip to /opt/venv/bin/vulkan/llama_server
INFO:     Set executable permissions for /opt/venv/bin/vulkan/llama_server/build/bin/llama-server
INFO:     Set executable permissions for /opt/venv/bin/vulkan/llama_server/build/bin/llama-cli
DEBUG:    https://huggingface.co:443 "GET /api/models/unsloth/Qwen3-0.6B-GGUF/tree/main?recursive=True&expand=False HTTP/1.1" 200 8793
DEBUG:    https://huggingface.co:443 "GET /api/models/unsloth/Qwen3-0.6B-GGUF/revision/main HTTP/1.1" 200 7869
Fetching 1 files: 100%|██████████| 1/1 [00:00<00:00, 1764.54it/s]
DEBUG:    GGUF file paths: {'variant': '/srv/lemonade/hub/models--unsloth--Qwen3-0.6B-GGUF/snapshots/50968a4468ef4233ed78cd7c3de230dd1d61a56b/Qwen3-0.6B-Q4_0.gguf'}
DEBUG:    Set LD_LIBRARY_PATH to /opt/venv/bin/vulkan/llama_server/build/bin
DEBUG:    Starting new HTTP connection (1): localhost:49871
DEBUG:    Not able to connect to llama-server yet, will retry
DEBUG:    LLAMA SERVER GPU: load_backend: loaded RPC backend from /srv/lemonade/bin/vulkan/llama_server/build/bin/libggml-rpc.so
DEBUG:    LLAMA SERVER GPU: ggml_vulkan: Found 1 Vulkan devices:
INFO:     GPU acceleration active: 1 device(s) detected by llama-server
DEBUG:    LLAMA SERVER GPU: ggml_vulkan: 0 = AMD Radeon Graphics (RADV PHOENIX) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
DEBUG:    LLAMA SERVER GPU: load_backend: loaded Vulkan backend from /srv/lemonade/bin/vulkan/llama_server/build/bin/libggml-vulkan.so
DEBUG:    LLAMA SERVER GPU: build: 6097 (9515c613) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
DEBUG:    LLAMA SERVER GPU: system info: n_threads = 8, n_threads_batch = 8, total_threads = 16
DEBUG:    LLAMA SERVER GPU:
DEBUG:    LLAMA SERVER GPU: system_info: n_threads = 8 (n_threads_batch = 8) / 16 |
DEBUG:    LLAMA SERVER GPU:
DEBUG:    LLAMA SERVER GPU: main: binding port with default address family
DEBUG:    LLAMA SERVER GPU: main: HTTP server is listening, hostname: 127.0.0.1, port: 49871, http threads: 15
DEBUG:    LLAMA SERVER GPU: main: loading model
DEBUG:    LLAMA SERVER GPU: srv    load_model: loading model '/srv/lemonade/hub/models--unsloth--Qwen3-0.6B-GGUF/snapshots/50968a4468ef4233ed78cd7c3de230dd1d61a56b/Qwen3-0.6B-Q4_0.gguf'
DEBUG:    LLAMA SERVER GPU: llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon Graphics (RADV PHOENIX)) - 50682 MiB free
DEBUG:    LLAMA SERVER GPU: llama_model_loader: loaded meta data with 32 key-value pairs and 310 tensors from /srv/lemonade/hub/models--unsloth--Qwen3-0.6B-GGUF/snapshots/50968a4468ef4233ed78cd7c3de230dd1d61a56b/Qwen3-0.6B-Q4_0.gguf (version GGUF V3 (latest))
DEBUG:    LLAMA SERVER GPU: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
DEBUG:    LLAMA SERVER GPU: llama_model_loader: - kv   0:                       general.architecture str              = qwen3
DEBUG:    LLAMA SERVER GPU: llama_model_loader: - kv   1:                               general.type str              = model
DEBUG:    LLAMA SERVER GPU: llama_model_loader: - kv   2:                               general.name str              = Qwen3-0.6B
DEBUG:    LLAMA SERVER GPU: llama_model_loader: - kv   3:                           general.basename str              = Qwen3-0.6B
DEBUG:    LLAMA SERVER GPU: llama_model_loader: - kv   4:                       general.quantized_by str              = Unsloth
DEBUG:    LLAMA SERVER GPU: llama_model_loader: - kv   5:                         general.size_label str              = 0.6B
DEBUG:    LLAMA SERVER GPU: llama_model_loader: - kv   6:                           general.repo_url str              = https://huggingface.co/unsloth
DEBUG:    LLAMA SERVER GPU: llama_model_loader: - kv   7:                          qwen3.block_count u32              = 28
DEBUG:    LLAMA SERVER GPU: llama_model_loader: - kv   8:                       qwen3.context_length u32              = 40960
DEBUG:    LLAMA SERVER GPU: llama_model_loader: - kv   9:                     qwen3.embedding_length u32              = 1024
DEBUG:    LLAMA SERVER GPU: llama_model_loader: - kv  10:                  qwen3.feed_forward_length u32              = 3072
DEBUG:    LLAMA SERVER GPU: llama_model_loader: - kv  11:                 qwen3.attention.head_count u32              = 16
DEBUG:    LLAMA SERVER GPU: llama_model_loader: - kv  12:              qwen3.attention.head_count_kv u32              = 8
DEBUG:    LLAMA SERVER GPU: llama_model_loader: - kv  13:                       qwen3.rope.freq_base f32              = 1000000.000000
DEBUG:    LLAMA SERVER GPU: llama_model_loader: - kv  14:     qwen3.attention.layer_norm_rms_epsilon f32              = 0.000001
DEBUG:    LLAMA SERVER GPU: llama_model_loader: - kv  15:                 qwen3.attention.key_length u32              = 128
DEBUG:    LLAMA SERVER GPU: llama_model_loader: - kv  16:               qwen3.attention.value_length u32              = 128
DEBUG:    LLAMA SERVER GPU: llama_model_loader: - kv  17:                       tokenizer.ggml.model str              = gpt2
DEBUG:    LLAMA SERVER GPU: llama_model_loader: - kv  18:                         tokenizer.ggml.pre str              = qwen2
DEBUG:    LLAMA SERVER GPU: llama_model_loader: - kv  19:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
DEBUG:    LLAMA SERVER GPU: llama_model_loader: - kv  20:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
DEBUG:    LLAMA SERVER GPU: llama_model_loader: - kv  21:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
DEBUG:    LLAMA SERVER GPU: llama_model_loader: - kv  22:                tokenizer.ggml.eos_token_id u32              = 151645
DEBUG:    LLAMA SERVER GPU: llama_model_loader: - kv  23:            tokenizer.ggml.padding_token_id u32              = 151654
DEBUG:    LLAMA SERVER GPU: llama_model_loader: - kv  24:               tokenizer.ggml.add_bos_token bool             = false
DEBUG:    LLAMA SERVER GPU: llama_model_loader: - kv  25:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
DEBUG:    LLAMA SERVER GPU: llama_model_loader: - kv  26:               general.quantization_version u32              = 2
DEBUG:    LLAMA SERVER GPU: llama_model_loader: - kv  27:                          general.file_type u32              = 2
DEBUG:    LLAMA SERVER GPU: llama_model_loader: - kv  28:                      quantize.imatrix.file str              = Qwen3-0.6B-GGUF/imatrix_unsloth.dat
DEBUG:    LLAMA SERVER GPU: llama_model_loader: - kv  29:                   quantize.imatrix.dataset str              = unsloth_calibration_Qwen3-0.6B.txt
DEBUG:    LLAMA SERVER GPU: llama_model_loader: - kv  30:             quantize.imatrix.entries_count u32              = 196
DEBUG:    LLAMA SERVER GPU: llama_model_loader: - kv  31:              quantize.imatrix.chunks_count u32              = 688
DEBUG:    LLAMA SERVER GPU: llama_model_loader: - type  f32:  113 tensors
DEBUG:    LLAMA SERVER GPU: llama_model_loader: - type q4_0:  193 tensors
DEBUG:    LLAMA SERVER GPU: llama_model_loader: - type q4_1:    3 tensors
DEBUG:    LLAMA SERVER GPU: llama_model_loader: - type q6_K:    1 tensors
DEBUG:    LLAMA SERVER GPU: print_info: file format = GGUF V3 (latest)
DEBUG:    LLAMA SERVER GPU: print_info: file type   = Q4_0
DEBUG:    LLAMA SERVER GPU: print_info: file size   = 358.78 MiB (5.05 BPW)
DEBUG:    LLAMA SERVER GPU: load: printing all EOG tokens:
DEBUG:    LLAMA SERVER GPU: load:   - 151643 ('<|endoftext|>')
DEBUG:    LLAMA SERVER GPU: load:   - 151645 ('<|im_end|>')
DEBUG:    LLAMA SERVER GPU: load:   - 151662 ('<|fim_pad|>')
DEBUG:    LLAMA SERVER GPU: load:   - 151663 ('<|repo_name|>')
DEBUG:    LLAMA SERVER GPU: load:   - 151664 ('<|file_sep|>')
DEBUG:    LLAMA SERVER GPU: load: special tokens cache size = 26
DEBUG:    LLAMA SERVER GPU: load: token to piece cache size = 0.9311 MB
DEBUG:    LLAMA SERVER GPU: print_info: arch             = qwen3
DEBUG:    LLAMA SERVER GPU: print_info: vocab_only       = 0
DEBUG:    LLAMA SERVER GPU: print_info: n_ctx_train      = 40960
DEBUG:    LLAMA SERVER GPU: print_info: n_embd           = 1024
DEBUG:    LLAMA SERVER GPU: print_info: n_layer          = 28
DEBUG:    LLAMA SERVER GPU: print_info: n_head           = 16
DEBUG:    LLAMA SERVER GPU: print_info: n_head_kv        = 8
DEBUG:    LLAMA SERVER GPU: print_info: n_rot            = 128
DEBUG:    LLAMA SERVER GPU: print_info: n_swa            = 0
DEBUG:    LLAMA SERVER GPU: print_info: is_swa_any       = 0
DEBUG:    LLAMA SERVER GPU: print_info: n_embd_head_k    = 128
DEBUG:    LLAMA SERVER GPU: print_info: n_embd_head_v    = 128
DEBUG:    LLAMA SERVER GPU: print_info: n_gqa            = 2
DEBUG:    LLAMA SERVER GPU: print_info: n_embd_k_gqa     = 1024
DEBUG:    LLAMA SERVER GPU: print_info: n_embd_v_gqa     = 1024
DEBUG:    LLAMA SERVER GPU: print_info: f_norm_eps       = 0.0e+00
DEBUG:    LLAMA SERVER GPU: print_info: f_norm_rms_eps   = 1.0e-06
DEBUG:    LLAMA SERVER GPU: print_info: f_clamp_kqv      = 0.0e+00
DEBUG:    LLAMA SERVER GPU: print_info: f_max_alibi_bias = 0.0e+00
DEBUG:    LLAMA SERVER GPU: print_info: f_logit_scale    = 0.0e+00
DEBUG:    LLAMA SERVER GPU: print_info: f_attn_scale     = 0.0e+00
DEBUG:    LLAMA SERVER GPU: print_info: n_ff             = 3072
DEBUG:    LLAMA SERVER GPU: print_info: n_expert         = 0
DEBUG:    LLAMA SERVER GPU: print_info: n_expert_used    = 0
DEBUG:    LLAMA SERVER GPU: print_info: causal attn      = 1
DEBUG:    LLAMA SERVER GPU: print_info: pooling type     = -1
DEBUG:    LLAMA SERVER GPU: print_info: rope type        = 2
DEBUG:    LLAMA SERVER GPU: print_info: rope scaling     = linear
DEBUG:    LLAMA SERVER GPU: print_info: freq_base_train  = 1000000.0
DEBUG:    LLAMA SERVER GPU: print_info: freq_scale_train = 1
DEBUG:    LLAMA SERVER GPU: print_info: n_ctx_orig_yarn  = 40960
DEBUG:    LLAMA SERVER GPU: print_info: rope_finetuned   = unknown
DEBUG:    LLAMA SERVER GPU: print_info: model type       = 0.6B
DEBUG:    LLAMA SERVER GPU: print_info: model params     = 596.05 M
DEBUG:    LLAMA SERVER GPU: print_info: general.name     = Qwen3-0.6B
DEBUG:    LLAMA SERVER GPU: print_info: vocab type       = BPE
DEBUG:    LLAMA SERVER GPU: print_info: n_vocab          = 151936
DEBUG:    LLAMA SERVER GPU: print_info: n_merges         = 151387
DEBUG:    LLAMA SERVER GPU: print_info: BOS token        = 11 ','
DEBUG:    LLAMA SERVER GPU: print_info: EOS token        = 151645 '<|im_end|>'
DEBUG:    LLAMA SERVER GPU: print_info: EOT token        = 151645 '<|im_end|>'
DEBUG:    LLAMA SERVER GPU: print_info: PAD token        = 151654 '<|vision_pad|>'
DEBUG:    LLAMA SERVER GPU: print_info: LF token         = 198 'Ċ'
DEBUG:    LLAMA SERVER GPU: print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
DEBUG:    LLAMA SERVER GPU: print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
DEBUG:    LLAMA SERVER GPU: print_info: FIM MID token    = 151660 '<|fim_middle|>'
DEBUG:    LLAMA SERVER GPU: print_info: FIM PAD token    = 151662 '<|fim_pad|>'
DEBUG:    LLAMA SERVER GPU: print_info: FIM REP token    = 151663 '<|repo_name|>'
DEBUG:    LLAMA SERVER GPU: print_info: FIM SEP token    = 151664 '<|file_sep|>'
DEBUG:    LLAMA SERVER GPU: print_info: EOG token        = 151643 '<|endoftext|>'
DEBUG:    LLAMA SERVER GPU: print_info: EOG token        = 151645 '<|im_end|>'
DEBUG:    LLAMA SERVER GPU: print_info: EOG token        = 151662 '<|fim_pad|>'
DEBUG:    LLAMA SERVER GPU: print_info: EOG token        = 151663 '<|repo_name|>'
DEBUG:    LLAMA SERVER GPU: print_info: EOG token        = 151664 '<|file_sep|>'
DEBUG:    LLAMA SERVER GPU: print_info: max token length = 256
DEBUG:    LLAMA SERVER GPU: load_tensors: loading model tensors, this can take a while... (mmap = true)
DEBUG:    LLAMA SERVER GPU: llama_model_load: error loading model: make_cpu_buft_list: no CPU backend found
DEBUG:    LLAMA SERVER GPU: llama_model_load_from_file_impl: failed to load model
DEBUG:    LLAMA SERVER GPU: common_init_from_params: failed to load model '/srv/lemonade/hub/models--unsloth--Qwen3-0.6B-GGUF/snapshots/50968a4468ef4233ed78cd7c3de230dd1d61a56b/Qwen3-0.6B-Q4_0.gguf'
DEBUG:    LLAMA SERVER GPU: srv    load_model: failed to load model, '/srv/lemonade/hub/models--unsloth--Qwen3-0.6B-GGUF/snapshots/50968a4468ef4233ed78cd7c3de230dd1d61a56b/Qwen3-0.6B-Q4_0.gguf'
DEBUG:    LLAMA SERVER GPU: srv    operator(): operator(): cleaning up before exit...
DEBUG:    LLAMA SERVER GPU: main: exiting due to model loading error
FROM ubuntu:rolling

# Add the entrypoint script to the container
ADD container/entrypoint.sh /

# Set timezone, install dependencies, clone, build, and clean up
RUN ln -sf /usr/share/zoneinfo/Europe/London /etc/localtime && \
    apt-get update && \
    apt-get install -y --no-install-recommends \
    git \
    curl \
    libcurl4-openssl-dev \
    pciutils \
    mesa-vulkan-drivers \
    python3 \
    python3-pip \
    python3-venv \
    python3-invoke \
    python3-yaml \
    python3-typeguard \
    python3-packaging \
    python3-numpy \
    python3-fasteners \
    python3-git \
    python3-watchfiles \
    python3-websockets \
    python3-cpuinfo \
    python3-pytz \
    python3-zstandard \
    python3-fastapi \
    python3-uvicorn \
    python3-jinja2 \
    python3-tabulate \
    python3-sentencepiece \
    python3-dotenv \
    python3-filelock \
    python3-fsspec \
    python3-requests \
    python3-tqdm \
    python3-distro \
    python3-httpx \
    python3-regex \
    python3-protobuf \
    python3-certifi \
    python3-httpcore \
    python3-h11 \
    python3-charset-normalizer \
    python3-urllib3 && \
    python3 -m venv --system-site-packages /opt/venv && \
    . /opt/venv/bin/activate && \
    mkdir -p /root/lemonade && \
    git clone https://github.com/lemonade-sdk/lemonade.git /root/lemonade/src && \
    cd /root/lemonade/src && \
    git checkout v8.1.3 && \
    pip3 install . && \
    apt-get autoremove --purge -y git python3-pip python3-venv && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/* /root/* && \
    chmod 755 /entrypoint.sh && \
    mkdir /srv/lemonade

# Set container port, volume, entrypoint, and command
EXPOSE 8000
VOLUME /srv/lemonade
ENTRYPOINT ["/entrypoint.sh"]
CMD ["lemonade-server-dev"]

Signed-off-by: FuchtelJockel <[email protected]>
@FuchtelJockel
Copy link
Author

@mcowger please append files instead of pasting long texts (logs).
I converted the script into a Containerfile that produces a working container on my machine.

@FuchtelJockel FuchtelJockel changed the title POC: add OCI container WIP: add OCI container Sep 1, 2025
@MatthK
Copy link

MatthK commented Sep 10, 2025

I just tried to build the docker image and fire up the container. While the build worked, and I can access the Webgui, I can't load a model. It keeps saying Server Offline and shows the following error message:

Error loading models: NetworkError when attempting to fetch resource.
Check the Lemonade Server logs via the system tray app for more information.

However, the docker log doesn't show anything, and I'm not sure what other log I could check.
And the next question then would be, how to take advantage of the Strix Halo GPU of my AMD RYZEN AI MAX+ 395 w/Radeon 8060S x 32

@FuchtelJockel
Copy link
Author

FuchtelJockel commented Sep 10, 2025

@MatthK You are probably running into this #229

The web UI cannot connect to the server while a model is being fetched.

Your GPU supports Vulkan and the instructions are in the README included in this commit. The NPU is only supported on Windows.

@MatthK
Copy link

MatthK commented Sep 11, 2025

I'm not sure it's that problem. I see the Error loading models message from the very beginning, before I even do anything. I just go to the WebGUI and the error message already shows up. I would assume Lemonade is not trying to download some standard models by default?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants