Skip to content

Conversation

@J-Montgomery
Copy link

The code implements a fast, vectorized approximation to exp(x). This trick should work on most backends, but is implemented only for AVX/AVX2/AVX512 to get some feedback on whether this is actually useful.

The constants used are optimized for explainability rather than minimizing average relative error. You can do a bit better with slightly different values over various input ranges, but the actual accuracy didn't seem particularly important in this context.

How it works

An unvectorized version of how the implementation works:

float exp_approx(float x) {
    const float log2e = 12102203.2f; // constant ≈ log2(e) * 2^23
    const float bias = 0x3f800000; // constant = 127.0 * 2^23

    int32_t i = log2e*x+bias;
    return *(float *)&i;
}

The implementation works by taking advantage of the standard IEEE-754 float representation:

struct float {
    sign     : 1;
    exponent : 8;
    mantissa : 23;
}

Because IEEE-754 floats are essentially a log2 representation of values, the instructions that convert to and from integers to floats compute functions very similar to 2^x and log2(x) respectively. The implementation uses an integer->float conversion to compute the exponential by computing an approximation to 2^x, which is equivalent to exp(x) because:
e^x = 2^(x * log2(e))

However, just multiplying by a factor of log2(e) isn't quite enough. Everything before the conversion is happening in a log-space representation and we have to adjust all the factors by 2^23, the size of the mantissa. Operations on the exponent also need to be re-biased, which we can do by adding an adjusted constant of 127. The code wraps these up into a single FMA before casting.

The resulting approximation averages less than 10% relative error over the ±88.0f domain of the original exp() function, but differs slightly in out of domain behavior. In order to replicate the behavior of the existing implementation, this implementation clamps values outside the domain. The implementation passes CI tests on my system, but runs somewhat slower than an unclamped implementation which does not pass CI tests.

Testing notes

I've tested this against both the CI tests and interactive use. Nothing I've tested seems broken and I observed a small speed boost over the existing implementation in the interactive case.

Help from someone with more domain expertise in evaluating whether this breaks anything would be greatly appreciated.

The code implements a fast, vectorized approximation to exp(x). The
approximation trick should work on most systems, but is implemented
only for AVX/AVX2/AVX512 to start.

Constants used are optimized for explainability rather than minimizing
average relative error.

The unvectorized implementation looks like:

float exp_fast(float x) {
    int32_t i = 12102203.2f*x+0x3f800000;
    return *(float *)&i;
}

The result is accurate to std::exp() within 10% average relative error.

Explanation:

We know that
e^x = 2^(x * log2(e))

The code works by taking advantage of the approximate 2^x computed
during float->integer conversions. Doing this correctly requires adjusting
the bias with exponent+127 and multiplying by a log2(e) factor to compute
e^x instead of 2^x. This allows all of the scaling to happen with a single
FMA.

log2e constant ~~ 2^23 * log2(e)
bias constant ~~ 127.0 * 2^23

The resulting approximation is valid over the +- 88 domain of the exp()
function. In order to replicate the behavior of the existing
implementation, the commit clamps values outside the domain. This allows
the commit to pass CI tests, but also runs somewhat slower than the
unclamped implementation.
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Oct 28, 2024
@ggerganov
Copy link
Member

The resulting approximation averages less than 10% relative error over the ±88.0f domain of the original exp() function

A 10% relative error sounds a bit high to me. Not sure it is worth the speed-up, which I suspect is very tiny for the overall inference?

@J-Montgomery
Copy link
Author

J-Montgomery commented Oct 29, 2024

I used the term "average relative error" in an extremely confusing way.
To deal with my linguistic shortcomings, it might be easier to look at a graph of what's going on. Here's the implementation that's currently in this PR:
original_figure
Blue line is std::exp(x), red dash is the approximation. You can see how it's essentially a piecewise linear approximation of the exponential function. The lower plot charts the relative error, blue line is per-point and red dash is computed average, 4% for these parameters. Both of these charts apply to the vast majority of the function domain. Very close to the extremes, floating point representation gets a bit sparse and quantization error leads to increased relative error. The ones I've seen are still quite small in absolute terms e.g. 3e-39 vs 7e-40.

Parameter tuning improves the relative error situation if that's still an issue. For example, here's 1.5% relative error with log2e = 12102204.0 and bias = 0x3F77B2E0:
optimized_minimum_average_error

Another interesting pair of parameters (12102204, 0x3F7A6760) bounds the worst case error (except previous caveats) to under 3% relative, at the cost of ~1.8% average relative error:
optimized_bounded_error

It's not entirely clear to me how numerical precision interacts with inference though, so it's a bit difficult to recommend a particular parameter set as opposed to illustrating the general concept.

You can improve on these with additional instructions, but that quickly gets to the point where using @jart's implementation is probably better.

@slaren
Copy link
Member

slaren commented Oct 29, 2024

The error is high enough that it causes some ops to fail test-backend-ops. Ultimately though, any amount of error is going to be too high unless there is a significant difference in end-to-end performance.

Backend 1/2: CUDA0
  Device description: NVIDIA GeForce RTX 3090 Ti
  Device memory: 24563 MB (23287 MB free)

  CROSS_ENTROPY_LOSS(type=f32,ne=[10,5,4,3]): [SOFT_MAX] NMSE = 0.000006175 > 0.000000100 [CROSS_ENTROPY_LOSS] NMSE = 0.000000210 > 0.000000100 FAIL
  1776/1777 tests passed
  Backend CUDA0: FAIL

@J-Montgomery
Copy link
Author

Thanks, had to enable the cuda backend to get that test failing properly. Looks like another order of magnitude in accuracy (~0.1%) is enough to get it passing.

Seems like a good stopping point, so I'll close this out unless anyone thinks it's useful to take it further.

@jart
Copy link
Contributor

jart commented Oct 29, 2024

Nice approximation. It's pretty far away from expf() by IEEE standards. Although I'd imagine it'd be good for many use cases. If you include it, then I'd say to put it behind a flag. For example in llamafile we have a --fast flag that's intended for things like this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants