rocm fork very slow on some/large models (goliath q3_k_s) #530
ChristophHaag
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
The rocm fork has no issue tracker, so I'll post here.
With some smaller models the rocm fork has worked fine, but running goliath q3_k_s for example is very very slow.
Archlinux, ryzen 3950X, radeon 6900 XT, 64 gb ram 3200 MHz ram.
I know it's not going to be fast on that hardware, but with clblast it's still much much faster than rocm. As far as I can tell it's not filling vram or ram (trying zram at the moment so somewhat hard to tell).
koboldcpp-rocm 11aa596: Processing:112.84s (7522.5ms/T), Generation:886.43s (42210.8ms/T), Total:999.26s (0.02T/s)
python koboldcpp-rocm/koboldcpp.py goliath-120b.Q3_K_S.gguf --usecublas mmq --gpulayers 16 --contextsize 4096
*** Welcome to KoboldCpp - Version 1.49.yr1-ROCm Attempting to use hipBLAS library for faster prompt ingestion. A compatible AMD GPU will be required. Initializing dynamic library: koboldcpp_hipblas.so ========== Namespace(model=None, model_param='goliath-120b.Q3_K_S.gguf', port=5001, port_param=5001, host='', launch=False, lora=None, config=None, threads=15, blasthreads=15, highpriority=False, contextsize=4096, blasbatchsize=512, ropeconfig=[0.0, 10000.0], smartcontext=False, noshift=False, bantokens=None, forceversion=0, nommap=False, usemlock=False, noavx2=False, debugmode=0, skiplauncher=False, hordeconfig=None, noblas=False, useclblast=None, usecublas=['mmq'], gpulayers=16, tensor_split=None, onready='', multiuser=False, remotetunnel=False, foreground=False, preloadstory='') ========== Loading model: goliath-120b.Q3_K_S.gguf [Threads: 15, BlasThreads: 15, SmartContext: False, ContextShift: True]Identified as LLAMA model: (ver 6)
Attempting to Load...
Using automatic RoPE scaling (scale:1.000, base:10000.0)
System Info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: maybe
ggml_init_cublas: CUDA_USE_TENSOR_CORES: maybe
ggml_init_cublas: found 1 ROCm devices:
Device 0: AMD Radeon RX 6900 XT, compute capability 10.3
llama_model_loader: loaded meta data with 20 key-value pairs and 1236 tensors from goliath-120b.Q3_K_S.gguf (version GGUF V3 (latest))
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 137
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 4096
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = unknown, may not work
llm_load_print_meta: model params = 117.75 B
llm_load_print_meta: model size = 47.22 GiB (3.45 BPW)
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 '
''llm_load_print_meta: EOS token = 2 '
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.45 MB
llm_load_tensors: using ROCm for GPU acceleration
llm_load_tensors: mem required = 42746.17 MB
llm_load_tensors: offloading 16 repeating layers to GPU
llm_load_tensors: offloaded 16/140 layers to GPU
llm_load_tensors: VRAM used: 5611.00 MB
....................................................................................................
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size = 2192.00 MB
llama_build_graph: non-view tensors processed: 3155/3155
llama_new_context_with_model: compute buffer total size = 574.63 MB
llama_new_context_with_model: VRAM scratch buffer: 568.00 MB
llama_new_context_with_model: total VRAM used: 6179.00 MB (model: 5611.00 MB, context: 568.00 MB)
Load Model OK: True
Embedded Kobold Lite loaded.
Starting Kobold HTTP Server on port 5001
Please connect to custom endpoint at http://localhost:5001
Input: {"n": 1, "max_context_length": 4096, "max_length": 1024, "rep_pen": 1.15, "temperature": 1.5, "top_p": 1, "top_k": 0, "top_a": 0, "typical": 1, "tfs": 0.69, "rep_pen_range": 1024, "rep_pen_slope": 0.1, "sampler_order": [6, 0, 1, 3, 4, 2, 5], "memory": "", "min_p": 0, "genkey": "KCPP4587", "prompt": "\nUSER: tell me a joke\nASSISTANT: ", "quiet": true, "stop_sequence": ["USER:", "ASSISTANT:"], "use_default_badwordsids": false}
Processing Prompt (15 / 15 tokens)
Generating (21 / 1024 tokens)
(EOS token triggered!)
ContextLimit: 36/4096, Processing:112.84s (7522.5ms/T), Generation:886.43s (42210.8ms/T), Total:999.26s (0.02T/s)
Output: Why did the tomato turn red?
Because it saw the salad dressing!
koboldcpp a00a32e: Processing:10.15s (676.8ms/T), Generation:24.14s (1149.3ms/T), Total:34.29s (0.61T/s)
RUSTICL_ENABLE=radeonsi python /koboldcpp/koboldcpp.py goliath-120b.Q3_K_S.gguf --useclblast 0 0 --gpulayers 16 --contextsize 4096
*** Welcome to KoboldCpp - Version 1.49 Attempting to use CLBlast library for faster prompt ingestion. A compatible clblast will be required. Initializing dynamic library: koboldcpp_clblast.so ========== Namespace(model=None, model_param='goliath-120b.Q3_K_S.gguf', port=5001, port_param=5001, host='', launch=False, lora=None, config=None, threads=15, blasthreads=15, highpriority=False, contextsize=4096, blasbatchsize=512, ropeconfig=[0.0, 10000.0], smartcontext=False, noshift=False, bantokens=None, forceversion=0, nommap=False, usemlock=False, noavx2=False, debugmode=0, skiplauncher=False, hordeconfig=None, noblas=False, useclblast=[0, 0], usecublas=None, gpulayers=16, tensor_split=None, onready='', multiuser=False, remotetunnel=False, foreground=False, preloadstory='') ========== Loading model: goliath-120b.Q3_K_S.gguf [Threads: 15, BlasThreads: 15, SmartContext: False, ContextShift: True]Identified as LLAMA model: (ver 6)
Attempting to Load...
Using automatic RoPE scaling (scale:1.000, base:10000.0)
System Info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
Platform:0 Device:0 - rusticl with AMD Radeon RX 6900 XT (navi21, LLVM 16.0.6, DRM 3.54, 6.6.1-arch1-1)
Platform:1 Device:0 - AMD Accelerated Parallel Processing with gfx1030
ggml_opencl: selecting platform: 'rusticl'
ggml_opencl: selecting device: 'AMD Radeon RX 6900 XT (navi21, LLVM 16.0.6, DRM 3.54, 6.6.1-arch1-1)'
ggml_opencl: device FP16 support: false
CL FP16 temporarily disabled pending further optimization.
llama_model_loader: loaded meta data with 20 key-value pairs and 1236 tensors from goliath-120b.Q3_K_S.gguf (version GGUF V3 (latest))
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 137
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 4096
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = all F32
llm_load_print_meta: model params = 117.75 B
llm_load_print_meta: model size = 47.22 GiB (3.45 BPW)
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 '
''llm_load_print_meta: EOS token = 2 '
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.45 MB
llm_load_tensors: using OpenCL for GPU acceleration
llm_load_tensors: mem required = 42746.17 MB
llm_load_tensors: offloading 16 repeating layers to GPU
llm_load_tensors: offloaded 16/138 layers to GPU
llm_load_tensors: VRAM used: 5611.00 MB
....................................................................................................
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size = 2192.00 MB
llama_build_graph: non-view tensors processed: 3155/3155
llama_new_context_with_model: compute buffer total size = 574.63 MB
Load Model OK: True
Embedded Kobold Lite loaded.
Starting Kobold HTTP Server on port 5001
Please connect to custom endpoint at http://localhost:5001
Input: {"n": 1, "max_context_length": 4096, "max_length": 1024, "rep_pen": 1.15, "temperature": 1.5, "top_p": 1, "top_k": 0, "top_a": 0, "typical": 1, "tfs": 0.69, "rep_pen_range": 1024, "rep_pen_slope": 0.1, "sampler_order": [6, 0, 1, 3, 4, 2, 5], "memory": "", "min_p": 0, "genkey": "KCPP8849", "prompt": "\nUSER: tell me a joke\nASSISTANT: ", "quiet": true, "stop_sequence": ["USER:", "ASSISTANT:"], "use_default_badwordsids": false}
Processing Prompt (15 / 15 tokens)
Generating (21 / 1024 tokens)
(EOS token triggered!)
ContextLimit: 36/4096, Processing:10.15s (676.8ms/T), Generation:24.14s (1149.3ms/T), Total:34.29s (0.61T/s)
Output: Why did the tomato turn red?
Because it saw the salad dressing!
Beta Was this translation helpful? Give feedback.
All reactions