Skip to content

Conversation

@Alcpz
Copy link
Contributor

@Alcpz Alcpz commented Nov 20, 2024


#10133 changes changed get_rows from false to true. I've detected a big regression for quantizations that support get_rows (llama3 Q8_0 for example).

@uniartisan Could you share more information of the device you used for offloading (where you saw increased performance)? Or did this just improve testing?

An example of regression:

model size params backend ngl sm mmap test t/s
llama 8B Q8_0 7.95 GiB 8.03 B SYCL 99 none 0 pp512 1340.34 ± 21.74
llama 8B Q8_0 7.95 GiB 8.03 B SYCL 99 none 0 tg128 88.64 ± 0.05

build: fab5d30 (4143)

With this revert:

model size params backend ngl sm mmap test t/s
llama 8B Q8_0 7.95 GiB 8.03 B SYCL 99 none 0 pp512 5777.93 ± 26.32
llama 8B Q8_0 7.95 GiB 8.03 B SYCL 99 none 0 tg128 89.31 ± 0.03

build: f4c4ce3

@slaren
Copy link
Member

slaren commented Nov 20, 2024

Returning true for GGML_OP_GET_ROWS in offload_op will cause the token embeddings to be copied to VRAM, which is almost never worth it since this is a big tensor and this op can be run very cheaply on the CPU. I imagine that RWK uses get_rows in some way that might make it worthwhile copying the weight to VRAM in that case, and that's why @uniartisan saw a speedup, but it needs to be done in a more selective way.

@github-actions github-actions bot added the SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language label Nov 20, 2024
@NeoZhangJianyu
Copy link
Collaborator

NeoZhangJianyu commented Nov 21, 2024

@Alcpz
Which GPU do you test with?

The PR #10133 has no impact to Intel Arc 770 for llama2-7b-q4 and Meta-Llama-3-8B.Q8_0.gguf.

@Alcpz
Copy link
Contributor Author

Alcpz commented Nov 21, 2024

I´ve tested multiple GPUs. The description has data for a Nvidia A100, but I also tested on an Arc 770 and a Data Center GPU Max 1100. For these two models I see regression in performance, though I'm using Meta-Llama-3.1-8B-Instruct-Q8_0.gguf.

See below additional performance information:


ID Device Type Name Version units group group size Driver version
0 [level_zero:gpu:0] Intel Data Center GPU Max 1100 12.60 448 1024 32 51539M 1.3.30049+10
model size params backend ngl threads sm mmap test t/s
llama 8B Q8_0 7.95 GiB 8.03 B SYCL 99 16 none 0 pp512 1204.45 ± 6.63
llama 8B Q8_0 7.95 GiB 8.03 B SYCL 99 16 none 0 tg128 21.83 ± 0.04

build: fab5d30 (4143)

model size params backend ngl threads sm mmap test t/s
llama 8B Q8_0 7.95 GiB 8.03 B SYCL 99 16 none 0 pp512 3228.17 ± 26.07
llama 8B Q8_0 7.95 GiB 8.03 B SYCL 99 16 none 0 tg128 21.82 ± 0.02

build: f4c4ce3 (this pr)


ID Device Type Name Version units group group size Driver version
0 [level_zero:gpu:0] Intel Arc A770 Graphics 12.55 512 1024 32 16225M 1.3.30049+10
model size params backend ngl threads sm mmap test t/s
llama 8B Q8_0 7.95 GiB 8.03 B SYCL 99 16 none 0 pp512 883.37 ± 1.00
llama 8B Q8_0 7.95 GiB 8.03 B SYCL 99 16 none 0 tg128 14.99 ± 0.00

build: fab5d30 (4143)

model size params backend ngl threads sm mmap test t/s
llama 8B Q8_0 7.95 GiB 8.03 B SYCL 99 16 none 0 pp512 1288.13 ± 7.94
llama 8B Q8_0 7.95 GiB 8.03 B SYCL 99 16 none 0 tg128 14.98 ± 0.00

build: f4c4ce3 (this pr)

@airMeng
Copy link
Contributor

airMeng commented Nov 21, 2024

@NeoZhangJianyu do you mean no regression during decoding phase?

@Alcpz Alcpz changed the title sycl : offload of get_rows set to 0 sycl : offload of get_rows set to false Nov 25, 2024
@NeoZhangJianyu
Copy link
Collaborator

@NeoZhangJianyu do you mean no regression during decoding phase?

I just test the model files as end to end.
Not find the performance change .

@NeoZhangJianyu
Copy link
Collaborator

NeoZhangJianyu commented Nov 29, 2024

@uniartisan
How about your idea? Since you are the author of PR #10133

@slaren
Copy link
Member

slaren commented Nov 29, 2024

@NeoZhangJianyu I assure you, this is a significant performance problem and needs to be fixed as soon as possible. It's hard to tell why you cannot reproduce this without more details about how you are testing.

@Rbiessy
Copy link
Collaborator

Rbiessy commented Nov 29, 2024

@NeoZhangJianyu you mentioned testing with Meta-Llama-3-8B.Q8_0.gguf while we are using Meta-Llama-3.1-8B-Instruct-Q8_0.gguf. Could that explain why you are not seeing the same performance drop?

@NeoZhangJianyu
Copy link
Collaborator

My test ignore the impact to "pp512".

@NeoZhangJianyu NeoZhangJianyu merged commit 0f77aae into ggml-org:master Nov 29, 2024
54 checks passed
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants