Skip to content

Conversation

aabbdev
Copy link

@aabbdev aabbdev commented Aug 26, 2025

  • Introduce global btoa() and atob() functions
  • Encoder: fast 12-bit pair-LUT, ~3.6 GB/s
  • Decoder: branchless streaming form, ~0.65 GB/s scalar
  • Tolerant to whitespace, validates padding and invalid input
  • Minimal allocations: only one malloc if input is wide-char
  • Fully compliant with DOMException

QJS(without SIMD):
Capture d’écran 2025-08-27 à 01 54 01
Node 24.3.0 (with SIMD):
Capture d’écran 2025-08-27 à 01 54 44
Deno 2.4.5(with SIMD):
Capture d’écran 2025-08-27 à 01 55 09

@aabbdev aabbdev force-pushed the feat/base64 branch 2 times, most recently from b8bd46d to 690d752 Compare August 27, 2025 00:58
@saghul
Copy link
Contributor

saghul commented Aug 27, 2025

Nice! Quick q: don't these throw DOMException on error?

@aabbdev
Copy link
Author

aabbdev commented Aug 27, 2025

It’s true, I don’t currently follow DOMException. Bun doesn’t either, and Deno too (only returns the message). Node.js uses it occasionally, while Cloudflare Workers rely on it extensively.
I don’t think it’s mandatory right now, we can move forward without it and consider implementing DOMException more consistently across the runtime later
Capture d’écran 2025-08-27 à 11 45 56
Capture d’écran 2025-08-27 à 11 46 14

@aabbdev aabbdev force-pushed the feat/base64 branch 2 times, most recently from a7b66a2 to 96c34a2 Compare August 27, 2025 10:11
@aabbdev
Copy link
Author

aabbdev commented Aug 27, 2025

@saghul I pushed some fixes: less code, no more cross-platform build errors from the LUTs. I didn’t add SIMD acceleration yet since I’m not sure if QuickJS would accept it.

We can easyly replace my scalar impl by this one https://github.com/powturbo/Turbo-Base64

@bptato
Copy link
Contributor

bptato commented Aug 27, 2025

I don’t think it’s mandatory right now, we can move forward without it and consider implementing DOMException more consistently across the runtime later

Counter point: for users that do need the web compat, a btoa/atob without DOMException is worse than no atob/btoa at all, since now I have to undo the incomplete implementation in favor of my complete one. bun/deno gets away with it because they are runtimes, not engines - here, quickjs.c is the engine, quickjs-libc the runtime.

Feel free to reuse my PR (#1040), I'll just close it then.

@aabbdev
Copy link
Author

aabbdev commented Aug 27, 2025

@bptato I can reuse your PR for the error exceptions, is it okay for you?

@bptato
Copy link
Contributor

bptato commented Aug 27, 2025

Yes, this is what I made it for in the first place.

@aabbdev
Copy link
Author

aabbdev commented Aug 27, 2025

@bptato have you pull the master and fix the conflicts?

Implemented separately from the other errors because it is defined in
terms of WebIDL, where members of an interface are getters on their
prototype.

See the difference between
`JSON.stringify(Object.getOwnPropertyDescriptors(new TypeError()))` vs
`JSON.stringify(Object.getOwnPropertyDescriptors(new DOMException()))`.

Note: the standard doesn't specify where to put "stack".  We follow
existing practice which imitates node instead of browsers.
@bptato
Copy link
Contributor

bptato commented Aug 27, 2025

OK, I've rebased it.
(For the record: I've also cleaned up a pointless goto from JS_ThrowDOMException, otherwise it's the same as before.)

@aabbdev
Copy link
Author

aabbdev commented Aug 27, 2025

@bptato Pushed, I rebased my branch on yours, fix the conflicts, change the error messages and change the "throw" test with the new DOMException object

@aabbdev aabbdev force-pushed the feat/base64 branch 3 times, most recently from 7bdce32 to bb9cbed Compare August 27, 2025 12:52
@aabbdev
Copy link
Author

aabbdev commented Aug 27, 2025

@bptato how would you like to handle the merge? Should this go into the engine or the runtime? In practice, 99% of runtimes expose btoa/atob. Even though they’re not in the official spec, almost every JS library that deals with base64 browser or server-side relies on them. Same with performance intrinsics: they’re not part of ECMAScript either, but QuickJS ships them by default. If we follow the same logic, btoa/atob should be included as well.

Some features come from a formal spec, others become standards simply because everyone uses them. btoa/atob are in that second category. We can ship a correct, efficient, clean implementation so people don’t have to reinvent it. Power users (like Bun) can still swap in their own for performance, but for 99% of cases a solid default is what people want.

@bptato
Copy link
Contributor

bptato commented Aug 27, 2025

Nice, thanks.

You could also unify JS_AddIntrinsicDOMException and JS_AddIntrinsicBase64, they make little sense separately. (I don't know a good name, esp. if we want structuredClone in the same bracket eventually - I guess JS_AddIntrinsicWeb, but it feels a bit too broad... any ideas?)

How you want to merge?

Whatever is more convenient for the maintainers (Saúl & Ben).

Should this go into the engine or the runtime?

#16 says it should be in the engine, so I think you put it in the right place.

@aabbdev
Copy link
Author

aabbdev commented Aug 27, 2025

@bptato I'm a bit busy, in a few hours I will check to give a better name, so you want to unify both together? I was thinking of separating them like I did so the developers can choose to activate some APIs based on their needs

@aabbdev
Copy link
Author

aabbdev commented Aug 27, 2025

@bptato JS_AddIntrinsicWinterTC(ctx); or JS_AddIntrinsicCommonAPI(ctx);
or granular:

JS_AddIntrinsicWinterTC_Fetch(ctx);
JS_AddIntrinsicWinterTC_URL(ctx);
JS_AddIntrinsicWinterTC_File(ctx);
JS_AddIntrinsicFetch(ctx);
JS_AddIntrinsicURL(ctx);
JS_AddIntrinsicCrypto(ctx);

If someone build a browser on top of quickjs he will want to follow different standard but majority of our developers base focus server-side.

@bptato
Copy link
Contributor

bptato commented Aug 27, 2025

I was thinking of separating them like I did so the developers can choose to activate some APIs based on their needs

Chopping it up too granularly just makes the API harder to use.
Right now, you can mistakenly add base64 without DOMException (and then error handling doesn't work). I also can't imagine a realistic scenario where you'd want DOMException by itself. Hence I'm proposing they should be a single package.

JS_AddIntrinsicWinterTC(ctx); or JS_AddIntrinsicCommonAPI(ctx);

But it doesn't add the entire spec, so these still feel wrong.

How about we just keep calling the unified function JS_AddIntrinsicBase64? Then if we ever get structuredClone, it can be renamed to JS_AddIntrinsicSerialization or something.

@saghul
Copy link
Contributor

saghul commented Aug 28, 2025

Note that Uint8Array recently got base64 added into it, so I guess we could piggiback on this implementation for that, and as such, make it builtin?

Thoughts @bnoordhuis ?

@aabbdev
Copy link
Author

aabbdev commented Aug 28, 2025

@saghul It’s something I can implement, I need first, to find the beautiful double free

- Introduce global btoa() and atob() functions
- Encoder: fast 12-bit pair-LUT, ~3.6 GB/s
- Decoder: branchless streaming form, ~0.65 GB/s scalar
- Tolerant to whitespace, validates padding and invalid input
- Minimal allocations: only one malloc if input is wide-char
- Fully compliant with DOMException
@aabbdev
Copy link
Author

aabbdev commented Aug 30, 2025

Edit: I had endianness problem, I sent a fix, let's see what say the test I don't have hardware to try on my side

@chqrlie
Copy link
Collaborator

chqrlie commented Aug 30, 2025

Encoder: fast 12-bit pair-LUT, ~3.6 GB/s

How much extra performance is obtained from this? It adds 8KB of data, makes the code less readable and prone to endianness issues. My own benchmarks (M2) actually show 2 to 5% slower times compared to the naive code:

static inline size_t b64_encode_naive(const uint8_t *src, size_t len, char *dst) {
    size_t i = 0, j = 0;

    for (; i + 2 < len; i += 3, j += 4) {
        uint32_t v = ((uint32_t)src[i] << 16) | ((uint32_t)src[i+1] << 8) | (uint32_t)src[i+2];
        dst[j+0] = B64_ENC[(v >> 18)];
        dst[j+1] = B64_ENC[(v >> 12) & 63];
        dst[j+2] = B64_ENC[(v >> 6)  & 63];
        dst[j+3] = B64_ENC[(v >> 0)  & 63];
    }

    size_t rem = len - i;
    if (rem == 1) {
        uint32_t v = ((uint32_t)src[i] << 16);
        dst[j++] = B64_ENC[(v >> 18)];
        dst[j++] = B64_ENC[(v >> 12) & 63];
        dst[j++] = '=';
        dst[j++] = '=';
    } else if (rem == 2) {
        uint32_t v = ((uint32_t)src[i] << 16) | ((uint32_t)src[i+1] << 8);
        dst[j++] = B64_ENC[(v >> 18)];
        dst[j++] = B64_ENC[(v >> 12) & 63];
        dst[j++] = B64_ENC[(v >> 6)  & 63];
        dst[j++] = '=';
    }
    return j;
}

@chqrlie
Copy link
Collaborator

chqrlie commented Aug 30, 2025

Regarding decoding, here is a faster function (20-25% faster) that uses a single 256-byte table:


// lut values values for valid chars (others don't matter)
static const int8_t B64_CODE[256] = {
    // whitespace
    [' ']=-1, ['\t']=-1, ['\r']=-1, ['\n']=-1,
    // padding
    ['=']=-2,
    // valid chars
    ['A']=1, ['B']=2, ['C']=3, ['D']=4, ['E']=5, ['F']=6, ['G']=7, ['H']=8,
    ['I']=9, ['J']=10,['K']=11,['L']=12,['M']=13,['N']=14,['O']=15,['P']=16,
    ['Q']=17,['R']=18,['S']=19,['T']=20,['U']=21,['V']=22,['W']=23,['X']=24,
    ['Y']=25,['Z']=26,
    ['a']=27,['b']=28,['c']=29,['d']=30,['e']=31,['f']=32,['g']=33,['h']=34,
    ['i']=35,['j']=36,['k']=37,['l']=38,['m']=39,['n']=40,['o']=41,['p']=42,
    ['q']=43,['r']=44,['s']=45,['t']=46,['u']=47,['v']=48,['w']=49,['x']=50,
    ['y']=51,['z']=52,
    ['0']=53,['1']=54,['2']=55,['3']=56,['4']=57,['5']=58,['6']=59,['7']=60,
    ['8']=61,['9']=62,
    ['-']=63, ['_']=64, // base64url; swap to '+'/'/' if using standard base64
};

static inline size_t
b64_decode_faster(const char *src, size_t len, uint8_t *dst, int *err)
{
    uint32_t acc = 0;
    int bits = 0;
    size_t j = 0;

    if (unlikely(err)) *err = 0;

    for (size_t i = 0; i < len; i++) {
        unsigned ch = (unsigned char)src[i];
        int flag = B64_CODE[ch];

        if (flag > 0) {
            // normal sextet
            acc = (acc << 6) + (flag - 1);
            bits += 6;
            if (bits == 24) {
                bits = 0;
                dst[j++] = (uint8_t)(acc >> 16);
                dst[j++] = (uint8_t)((acc >> 8) & 0xFF);
                dst[j++] = (uint8_t)(acc & 0xFF);
            }
        } else {
            if (flag == -1) {
                // whitespace -> skip
                continue;
            }
            if (flag == -2) {
                // '=' padding
                // flush pending bytes
                while (bits >= 8) {
                    bits -= 8;
                    dst[j++] = (uint8_t)((acc >> bits) & 0xFF);
                }
                // After '=', only ws or '=' is valid
                // Validate remaining input
                for (size_t k = i + 1; k < len; k++) {
                    unsigned ch2 = (unsigned char)src[k];
                    int f2 = B64_CODE[ch2];
                    if (f2 == -1) continue; // ws
                    if (ch2 != '=') goto fail;
                }
                break;
            } else {
                // invalid
                goto fail;
            }
        }
    }

    // Leftover bits are only valid if 0–2 '=' pads handled it
    if (unlikely(bits >= 6))
        goto fail;

    return j;
fail:
    if (err) *err = 1;
    return 0;
}

@aabbdev
Copy link
Author

aabbdev commented Aug 30, 2025

@chqrlie I’ve also force-pushed/rebased and fixed my implementation, so the encoder should be in better shape now. ;)

I'll look at the decoder, and I'm open to proposals.

From what I see, the main bottleneck isn’t in the encoder/decoder itself but in the btoa/atob wrappers. If you have ideas for optimizing atob, I'd love to hear them. Targeting at least 1 GB/s seems realistic, later we can swap with a faster base64 implementation without touching the wrappers

@chqrlie
Copy link
Collaborator

chqrlie commented Aug 30, 2025

@chqrlie I’ve also force-pushed/rebased and fixed my implementation, so the encoder should be in better shape now. ;)

The b64_encode and b64_decode functions seem identical to what I tested this morning.

I'll look at the decoder, and I'm open to proposals.

My version is faster because it handles 3 bytes at a time in the main case and flushes the pending bytes in the = case.

I have not looked at the wrappers but the code of the conversion is performed in the lower level functions. I just wanted to underscore that the 12-bit LUT tables are actually slower than the simpler naive implementation on my CPU so I was wondering if you could test on your own CPU (presumably intel based).

@saghul
Copy link
Contributor

saghul commented Sep 1, 2025

@aabbdev DOMException landed, can you please rebase your PR on top of master? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants