-
Notifications
You must be signed in to change notification settings - Fork 13.4k
add FP8 support to gguf/llama: #10055
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Well... I have many things to correct! |
0723ac5
to
08b6344
Compare
60ece70
to
3faf670
Compare
OK. now build is in good shape. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The native FP8
types are interesting since AFAIK there is hardware support for these.
What are the advantages of the FP8_Q
types, for example compared to the existing Q8_0
type?
Makefile
Outdated
ggml/src/ggml-fp8.cpp \ | ||
ggml/src/ggml-fp8.h \ | ||
ggml/src/ggml-common.h | ||
$(CXX) $(CXXFLAGS) -std=c++17 -c $< -o $@ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should target -std=c++11
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use "static constexpr" in this file that is not allowed with c++11 (and may be more).
And because it is the default with gcc, I develop/test the FP8 template with it.
I see you have it use for GGML_SYCL build to.
So I can try to rewrite it with c++11 before merge request, but is it really needed? Do you have target that do not support it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is not a good reason to change the project standard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have it for GGML_SYCL build, is there a reason to keep that old standart that was the 1er draft for modern C++?
I'll have a try to rewrite it with only c++11 and see how it look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Backends are for the most part self-contained and don't affect the rest of the project. What you are trying to merge here would be part of the core ggml and would require changing the standard for the entire project. I would also prefer to use C++17, but that is a discussion for another time.
ggml/src/ggml-common.h
Outdated
// - fp8 simple type | ||
typedef struct { uint8_t bits; } ggml_e5m2_t; | ||
typedef struct { uint8_t bits; } ggml_e4m3_t; | ||
typedef struct { uint8_t bits; } ggml_e3m4_t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems unused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes my bad. will remove it for next commit.
ggml/src/ggml-fp8.cpp
Outdated
template<int N> constexpr float EXP2() { | ||
if constexpr (N==0) return 1; | ||
if constexpr (N>0) return EXP2<N-1>()*2; | ||
if constexpr (N<0) return EXP2<N+1>()/2; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
C++11:
template<int N> constexpr float EXP2() { | |
if constexpr (N==0) return 1; | |
if constexpr (N>0) return EXP2<N-1>()*2; | |
if constexpr (N<0) return EXP2<N+1>()/2; | |
} | |
constexpr float exp2(int n) { | |
return n < 0 ? 1.0f / (1 << -n) : 1 << n; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not work for n > 32 or 64... we have it up to 127 ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
constexpr float exp2(int n) {
return n == 0 ? 1 : n > 0 ? 2 * exp2(n - 1) : 0.5 * exp2(n + 1);
}
ggml/src/ggml-fp8.cpp
Outdated
using type = FP8<_E>; | ||
static constexpr int E=_E; | ||
static constexpr int M=7-_E; | ||
static constexpr int E_BIAS=EXP2<_E-1>()-1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The not c++11 support is this 'static constexpr"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static const int
would do the same here.
Some possible advantages:
some advantage over FP8 is:
Note: In fact I have some good accuracy with bloc of 512 or 1024, but I prefer to keep the same bloc size that is use for other Qn_K quant. |
58863d0
to
4e81ab0
Compare
OK have hard time with macOS compiler... thanks @slaren @ggerganov |
Got it. So the native |
- correct local CI. - correct perplexity log
E5M2 & E4M3: for use with FP8 distributed model E4M3_Q & E3M4_Q: for gguf quantized model. E5M2 and A4M3 type are use like FP16 / BF16 native. E4M3_Q and E3M4_Q are define like Q8_0 with bloc size of 256 (like QK_K)
sorry, i misunderstood what you mean before. now i get it: apologize for my nonsense |
e6c4791
to
1488d7d
Compare
OK rebase with later master and now it is possible made use of c++17. I refactor the fp8 so dot is now in cpu-backend. There is a few copie for now but may be remove later with optimized code.
at least the tg is in line with the Q8 on my CPU. I hope I managed to take into account all/most comments / reviews. |
As far as I can tell, this conversion is not implemented, so it is not actually possible to use FP8 models. |
I have taken a look at the available FP8 models, and they use a single F32 scale So as it is, this implementation cannot be used with FP8 models, and the FP8 Q types perform worse than Q8_0. And I don't think that having a different group size than Q8_0 is a good reason to add them either. I am not sure that there are any practical applications for this code as it is at the moment. I still think that it would be good to have support for FP8 models, but it needs more work to achieve that. |
For this case we don't need to add it in quantization schemes, but can add it in model load ("just" add mul op with it after the matmul, like with bias.)?
FP8_Q is perfectly usable and with good result.
(Yes as you see NVIDIA/INTEL/AMD was wrong not add E3M4 in hardware.) High speed FP8-CPU kernel is WIP but need this PR #10446 for clean/simple Do you want we wait for this kernel before add FP8 ? |
It would be ok for a proof of concept, but it would effectively require adding a special case to every matrix multiplication to support FP8. The best way would be be to add support to ggml for quantization types that use one group/scale per row.
That's not good enough, it's still worse quality than Q8_0, and much slower. As it is, I don't see any motivation to use these types. It needs to offer a meaningful improvement in some aspect to justify the maintenance cost of adding all this code, especially since it would be adding new file types. I suppose one use for the FP8_Q types would be to use them with FP8 models by using the same scale for every group in the row, until we are able to support quantization types with one group per row in ggml. That would still require adding support to perform this conversion in |
I agree with @slaren's evaluation. We have to see more gains from these data types before we merge them. |
I see your point. I have some kernel that use BF16 for compute with that speed:
But I need some time to have it clean for PR and port it to the "cpu-extra-buffer". Do you think this is enough if I can get that? Note: I know how to make it faster but need more test/time for that. |
Any update on this. deep seekv3 use Fp8 as training format so this might be useful |
With the advent of DS R1 also being FP8 this is looking even more important |
super necro post, but I was wondering whether we had any other data types that support block size of 1 smaller than f16? in that area something like fp8e5m2 is useful, because right now almost all quant types use minimum block size of 32, and quite a few new models have tensor dimensions that are not multiples of 32 making those tensors unquantizable. A blocksize 1 quant, even a lossy 8 bit one, would provide some level of compression for those cases. these are me bouncing my idle thoughts off the wall. |
most quant (QN_K ...) use bloc of 256 ;) |
Yeah, why isn't this merged to llama.cpp and ik_llama.cpp? FP8 support is important. Even FP4 models are out. |
also interested now that Vulkan exposes HW support for it via [VK_EXT_shader_float8] !(https://registry.khronos.org/vulkan/specs/latest/man/html/VK_EXT_shader_float8.html) |
After make many test with FP8 this is the first part for add FP8 on llama.cpp.
I made some choose:
I implement 4 FP8 support:
E5M2 & E4M3: for use with FP8 distributed model
it is not for create quantized model from FP16/BF16 model, but for FP8 distributed model like Llama-3.1-405B-FP8 so I only add element for run.
E4M3_Q & E3M4_Q: for gguf quantized model.
With this on we can quantize model I made test with Mistral-Nemo / Mistral-7B and Llama-3.1-8B .
the weight have simple quantization process like with Q8_0, but with size bloc of 256 (Q8_0 use bloc of 32)
For now all is in place for CPU backend as reference for any arch, so only use pure C++. I use openmp-simd for a little speed-up.
I add 2 files gguf-fp8.h gguf-fp8.cpp to a lower change on existing sources, and simple merge.
For now I update Makefile build (need to figure what to do with CMake)
I prefer to use C++ for template on FP8 so I can have the 4 formats with little code size.
So for now any comment is welcome.
Next step is to make a true merge request, before add faster compute.
For those curious I have a high speed version I use for simple test here for zen4. And we have more discussion here