Improve accuracy of kde() invcdf estimates #124637

rhettinger · 2024-09-26T22:05:30Z

KDE's quartic kernel and triweight kernels employ the Newton-Raphson method to compute their inverse-cdf functions. For a starting point, they need a reasonably good estimate of the inverse function. This PR improves those estimates giving both a lower maximum error and lower mean absolute deviation.

Measurement tooling:

from math import sumprod, sin

def cdf(t):
    return sumprod((3/16, -5/8, 15/16, 1/2), (t**5, t**3, t, 1.0))

def invcdf_exact(p):
    'Use bisection to locate best inverse and then run forward'
    xlo = -1.0;            xhi = 1.0
    plo = 0.0;             phi = 1.0

    if p == plo: return (p, xlo)
    if p == phi: return (p, xhi)

    for i in range(55):
        xmid = (xlo + xhi) / 2
        pmid = cdf(xmid)
        if p < pmid:
            phi = pmid
            xhi = xmid
        else:
            plo = pmid
            xlo = xmid
    x = (xlo + xhi) / 2
    p = cdf(x)
    return p, x

def frange(lo, hi, n):
    width = (hi - lo) / n
    return [lo + width * i for i in range(n+1)]

def max_error(invcdf_estimated):
    err = 0.0
    for p in frange(0.0, 1.0, 2**20):
        p, xtgt = invcdf_exact(p)
        xhat = invcdf_estimated(p)
        delta = abs(xhat - xtgt)
        err = max(err, delta)
    return err

def baseline_invcdf_est(p):
     sign, p = (1.0, p) if p <= 1/2 else (-1.0, 1.0 - p)
     x = (2.0 * p) ** 0.4258865685331 - 1.0
     if p > 0.004 < 0.499:
         x += 0.026818732 * sin(7.101753784 * p + 2.73230839482953)
     return x * sign

def new_invcdf_est(p):
    sign, p = (1.0, p) if p <= 1/2 else (-1.0, 1.0 - p)
    if p < 0.0106:
        return ((2.0 * p) ** 0.3838 - 1.0) * sign
    x = (2.0 * p) ** 0.4258865685331 - 1.0
    if p < 0.499:
        x += 0.026818732 * sin(7.101753784 * p + 2.73230839482953)
    return x * sign

if __name__ == '__main__':

    for invcdf_est in baseline_invcdf_est, new_invcdf_est:
        name = invcdf_est.__name__
        error = max_error(invcdf_est)
        print(f'{error=:.6f} {name}')

Lib/statistics.py

eendebakpt · 2024-09-27T09:35:50Z

For the _quartic_invcdf_estimate I optimized the parameters using scipy.minimize. This function

    def _quartic_invcdf_estimate(p):
        sign, p = (1.0, p) if p <= 1/2 else (-1.0, 1.0 - p)
        if p < 0.0106:
            return ((1.9996239723034928 * p) ** 0.38546654940335057 - 1.0022674907227662) * sign
        x = (1.9999998194849533 * p) ** 0.4206905729144752 - 0.9999999839065042
        x += 0.029512700764239685 * sin(7.100744642607112 * p + 2.7328131666178157)
        return x * sign

has maximum error 0.0084 (for comparison: main has maximum error 0.0256, this PR has maximum error 0.0131). The function has the same structure, but omits the check p < 0.499. The value at p=0.5 is 2.7812396306266406e-17 (e.g. almost zero). If an exact zero is preferred, than adding a special case for p=.5 is possible. A plot of current main, the method from this PR and the optimized method

@rhettinger If you are interested I can make the optimization code available and also optimize the other inverse cdf. Besides these optimization considerations the PR looks good to me.

rhettinger · 2024-09-27T16:56:14Z

Thanks for putting some work into this. I'm going stick with the PR version though. In addition to tamping down the maximum error, we also want to keep down the mean absolute deviation and standard deviation. Other ideals include being exact and continuous at 0.0, 0.5, and 1.0 and never returning a value outside of 1.0 <= x <= 1.0. The PR version is mostly good enough though I would happier is the estimate and its first derivative were continuous across the input range. If this gets tweaked further, it likely needs a piecewise approximation with exact value and first derivative agreement at the piece connection points.

rhettinger · 2024-09-27T17:04:19Z

FWIW, this is the extra evaluation code I used:

def compute_errors(p_inputs, invcdf_estimated):
    xarr = []
    parr = []
    errors = []
    for i, p in enumerate(p_inputs):
        p, xtgt = invcdf_exact(p)
        xhat = invcdf_estimated(p)
        
        xarr.append(xtgt)
        parr.append(p)
        errors.append(xhat - xtgt)
    return parr, xarr, errors

parr, xarr, errors = compute_errors(frange(0, 1, 2**20), baseline_invcdf_est)
assert min(xarr) >= -1.0 and max(xarr) <= 1.0
print(f'{mean(map(abs, errors))=}\t\t{stdev(errors)=}')
print(min(errors, key=abs), quantiles(map(abs, errors)), max(errors, key=abs))

Here are the results for the baseline, pr, and optimized variants:

mean(map(abs, errors))=0.002040011700427382		stdev(errors)=0.0037462609013369052
0.0 [0.0004600527610310601, 0.0013789189103015909, 0.002204903435665151] 0.025565143408801783

mean(map(abs, errors))=0.0017825921273483375		stdev(errors)=0.00274805011129947
0.0 [0.00045672881942306065, 0.0013714790099657037, 0.002169513807214085] -0.013152195170197833

mean(map(abs, errors))=0.002457283836763061		stdev(errors)=0.0031467983483116236
9.279260693162428e-11 [0.0008678833129890928, 0.0020688450312686957, 0.0038715193678539594] 0.008426259817138737

Improve accuracy of quartic_invcdf_estimate

2365fed

rhettinger added the performance Performance or resource usage label Sep 26, 2024

rhettinger assigned pablogsal Sep 26, 2024

bedevere-app bot added the awaiting core review label Sep 26, 2024

rhettinger added skip issue skip news labels Sep 26, 2024

eendebakpt reviewed Sep 26, 2024

View reviewed changes

Lib/statistics.py Show resolved Hide resolved

Improve triweight invcdf estimate

d8ee745

rhettinger changed the title ~~Improve accuracy of quartic_invcdf_estimate~~ Improve accuracy of kde() invcdf estimates Sep 27, 2024

rhettinger unassigned pablogsal Sep 27, 2024

rhettinger merged commit 4b89c5e into python:main Sep 27, 2024
32 checks passed

bedevere-app bot removed the awaiting core review label Sep 27, 2024

rhettinger deleted the kde_quartic branch October 1, 2024 22:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Improve accuracy of kde() invcdf estimates #124637

Improve accuracy of kde() invcdf estimates #124637

Uh oh!

rhettinger commented Sep 26, 2024 •

edited

Loading

Uh oh!

Uh oh!

eendebakpt commented Sep 27, 2024

Uh oh!

rhettinger commented Sep 27, 2024

Uh oh!

Uh oh!

rhettinger commented Sep 27, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Improve accuracy of kde() invcdf estimates #124637

Improve accuracy of kde() invcdf estimates #124637

Uh oh!

Conversation

rhettinger commented Sep 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

eendebakpt commented Sep 27, 2024

Uh oh!

rhettinger commented Sep 27, 2024

Uh oh!

Uh oh!

rhettinger commented Sep 27, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rhettinger commented Sep 26, 2024 •

edited

Loading