Implement ggml_v_expf() with a fast approximation on AVX/AVX2/AVX512 #10069

J-Montgomery · 2024-10-28T04:02:45Z

The code implements a fast, vectorized approximation to exp(x). This trick should work on most backends, but is implemented only for AVX/AVX2/AVX512 to get some feedback on whether this is actually useful.

The constants used are optimized for explainability rather than minimizing average relative error. You can do a bit better with slightly different values over various input ranges, but the actual accuracy didn't seem particularly important in this context.

How it works

An unvectorized version of how the implementation works:

float exp_approx(float x) {
    const float log2e = 12102203.2f; // constant ≈ log2(e) * 2^23
    const float bias = 0x3f800000; // constant = 127.0 * 2^23

    int32_t i = log2e*x+bias;
    return *(float *)&i;
}

The implementation works by taking advantage of the standard IEEE-754 float representation:

struct float {
    sign     : 1;
    exponent : 8;
    mantissa : 23;
}

Because IEEE-754 floats are essentially a log2 representation of values, the instructions that convert to and from integers to floats compute functions very similar to 2^x and log2(x) respectively. The implementation uses an integer->float conversion to compute the exponential by computing an approximation to 2^x, which is equivalent to exp(x) because:
e^x = 2^(x * log2(e))

However, just multiplying by a factor of log2(e) isn't quite enough. Everything before the conversion is happening in a log-space representation and we have to adjust all the factors by 2^23, the size of the mantissa. Operations on the exponent also need to be re-biased, which we can do by adding an adjusted constant of 127. The code wraps these up into a single FMA before casting.

The resulting approximation averages less than 10% relative error over the ±88.0f domain of the original exp() function, but differs slightly in out of domain behavior. In order to replicate the behavior of the existing implementation, this implementation clamps values outside the domain. The implementation passes CI tests on my system, but runs somewhat slower than an unclamped implementation which does not pass CI tests.

Testing notes

I've tested this against both the CI tests and interactive use. Nothing I've tested seems broken and I observed a small speed boost over the existing implementation in the interactive case.

Help from someone with more domain expertise in evaluating whether this breaks anything would be greatly appreciated.

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

The code implements a fast, vectorized approximation to exp(x). The approximation trick should work on most systems, but is implemented only for AVX/AVX2/AVX512 to start. Constants used are optimized for explainability rather than minimizing average relative error. The unvectorized implementation looks like: float exp_fast(float x) { int32_t i = 12102203.2f*x+0x3f800000; return *(float *)&i; } The result is accurate to std::exp() within 10% average relative error. Explanation: We know that e^x = 2^(x * log2(e)) The code works by taking advantage of the approximate 2^x computed during float->integer conversions. Doing this correctly requires adjusting the bias with exponent+127 and multiplying by a log2(e) factor to compute e^x instead of 2^x. This allows all of the scaling to happen with a single FMA. log2e constant ~~ 2^23 * log2(e) bias constant ~~ 127.0 * 2^23 The resulting approximation is valid over the +- 88 domain of the exp() function. In order to replicate the behavior of the existing implementation, the commit clamps values outside the domain. This allows the commit to pass CI tests, but also runs somewhat slower than the unclamped implementation.

ggerganov · 2024-10-28T08:26:44Z

The resulting approximation averages less than 10% relative error over the ±88.0f domain of the original exp() function

A 10% relative error sounds a bit high to me. Not sure it is worth the speed-up, which I suspect is very tiny for the overall inference?

J-Montgomery · 2024-10-29T00:31:57Z

I used the term "average relative error" in an extremely confusing way.
To deal with my linguistic shortcomings, it might be easier to look at a graph of what's going on. Here's the implementation that's currently in this PR:

Blue line is std::exp(x), red dash is the approximation. You can see how it's essentially a piecewise linear approximation of the exponential function. The lower plot charts the relative error, blue line is per-point and red dash is computed average, 4% for these parameters. Both of these charts apply to the vast majority of the function domain. Very close to the extremes, floating point representation gets a bit sparse and quantization error leads to increased relative error. The ones I've seen are still quite small in absolute terms e.g. 3e-39 vs 7e-40.

Parameter tuning improves the relative error situation if that's still an issue. For example, here's 1.5% relative error with log2e = 12102204.0 and bias = 0x3F77B2E0:

Another interesting pair of parameters (12102204, 0x3F7A6760) bounds the worst case error (except previous caveats) to under 3% relative, at the cost of ~1.8% average relative error:

It's not entirely clear to me how numerical precision interacts with inference though, so it's a bit difficult to recommend a particular parameter set as opposed to illustrating the general concept.

You can improve on these with additional instructions, but that quickly gets to the point where using @jart's implementation is probably better.

slaren · 2024-10-29T01:15:36Z

The error is high enough that it causes some ops to fail test-backend-ops. Ultimately though, any amount of error is going to be too high unless there is a significant difference in end-to-end performance.

Backend 1/2: CUDA0
  Device description: NVIDIA GeForce RTX 3090 Ti
  Device memory: 24563 MB (23287 MB free)

  CROSS_ENTROPY_LOSS(type=f32,ne=[10,5,4,3]): [SOFT_MAX] NMSE = 0.000006175 > 0.000000100 [CROSS_ENTROPY_LOSS] NMSE = 0.000000210 > 0.000000100 FAIL
  1776/1777 tests passed
  Backend CUDA0: FAIL

J-Montgomery · 2024-10-29T03:52:35Z

Thanks, had to enable the cuda backend to get that test failing properly. Looks like another order of magnitude in accuracy (~0.1%) is enough to get it passing.

Seems like a good stopping point, so I'll close this out unless anyone thinks it's useful to take it further.

jart · 2024-10-29T11:05:17Z

Nice approximation. It's pretty far away from expf() by IEEE standards. Although I'd imagine it'd be good for many use cases. If you include it, then I'd say to put it behind a flag. For example in llamafile we have a --fast flag that's intended for things like this.

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Oct 28, 2024

Remove an unnecessary return statement that was accidentally committed

758a46a

J-Montgomery closed this Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement ggml_v_expf() with a fast approximation on AVX/AVX2/AVX512 #10069

Implement ggml_v_expf() with a fast approximation on AVX/AVX2/AVX512 #10069

Uh oh!

J-Montgomery commented Oct 28, 2024

Uh oh!

ggerganov commented Oct 28, 2024

Uh oh!

J-Montgomery commented Oct 29, 2024 •

edited

Loading

Uh oh!

slaren commented Oct 29, 2024

Uh oh!

J-Montgomery commented Oct 29, 2024

Uh oh!

jart commented Oct 29, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Implement ggml_v_expf() with a fast approximation on AVX/AVX2/AVX512 #10069

Implement ggml_v_expf() with a fast approximation on AVX/AVX2/AVX512 #10069

Uh oh!

Conversation

J-Montgomery commented Oct 28, 2024

How it works

Testing notes

Uh oh!

ggerganov commented Oct 28, 2024

Uh oh!

J-Montgomery commented Oct 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren commented Oct 29, 2024

Uh oh!

J-Montgomery commented Oct 29, 2024

Uh oh!

jart commented Oct 29, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

J-Montgomery commented Oct 29, 2024 •

edited

Loading