Skip to content

Bug: Step 3.5 Flash higher VRAM usage and loading issues after #1307 #1324

@Quairon-Nailo

Description

@Quairon-Nailo

What happened?

After #1307 , loading Step 3.5 Flash with sm graph takes more VRAM than it does before #1309 , and the distribution is different, leading to issues loading unless I reduce the number of GPU layers.
https://streamable.com/6tsws9
https://streamable.com/1kbey0
My build script:

#!/bin/bash
set -e  # Exit immediately if a command exits with a non-zero status

cd /home/quair/Downloads/ik_llama.cpp/
rm -rf build
cmake -B build \
          -DGGML_CUDA=ON \
          -DCMAKE_CUDA_ARCHITECTURES=native

cmake --build build --config Release -j$(nproc)

My load script:

#!/bin/bash
llama-server \
        --host 0.0.0.0 \
        --port 8085 \
        --model "/mnt/Speed/AI/Models/AesSedai/Step-3.5-Flash-GGUF/Step-3.5-Flash-IQ4_XS-00001-of-00003.gguf" \
        -a Step3.5 \
        -b 8192 \
        -ub 8192 \
        --threads 16 \
        --ctx-size 36864 \
        --n-gpu-layers 999 \
        -ot "(1[4-9]|[0-9][0-9])\..*_exps.*=CPU" \
        --no-mmap \
        -fa on \
        -sm graph \
        -ts 180,191 \
        -np 1 \
        -smgs \
        -cram 0 \
        -cuda fusion=1

Name and Version

version: 4225 (7065488)
built with cc (GCC) 15.2.1 20260209 for x86_64-pc-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions