You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Optimize binary encoding by directly emitting the null byte inside the
generated file. The null byte 00h is a valid UTF-8 character:
https://datatracker.ietf.org/doc/html/rfc3629
Given that we do still have the opt out -sSINGLE_FILE_BINARY_ENCODE=0
setting from binary encoding, I propose we try to take the encoding to
its maximum potential, and see if we can get away with emitting the null
byte as-is.
The benefit of this are two-fold:
a) assuming a uniform distribution of encoded bytes, not emitting nulls
takes +0.39% more space. (or with nulls, -0.26% smaller)
b) by not offsetting the bytes, any strings in the emitted binary data
will be directly human-readable, e.g.:
<img width="2464" height="1451" alt="image"
src="https://github.com/user-attachments/assets/e85edc36-da52-4274-8a43-092405a45850"
/>
So C strings will be directly parseable/searchable in the output. That
is appealing.
I do not currently know of dealbreaking reasons to not avoid nulls,
except than a generic FUD "editors/toolchains might be buggy to handle
the null."
But those are bugs of the editors, and we do have the
`-sSINGLE_FILE_BINARY_ENCODE=0` fallback to avoid this. So emitting
nulls will allow us to surface if there will be insurmountable issues
with null bytes in the output. We can always revert back to the previous
form, if a difficult blocker arises. (and as a plus, we will then have
learned about that blocker, concretely telling us why that approach will
not be feasible)
d+=1# Offset all bytes up by +1 to make zero (a very common value) be encoded with only one byte as 0x01. This is possible since we can encode 255 as 0x100 in UTF-8.
2969
3000
ifd==ord('"'):
2970
3001
# Escape double quote " character with a backspace since we are writing the binary string inside double quotes.
2971
3002
# Also closure optimizer will turn the string into being delimited with double quotes, even if it were single quotes to start with. (" -> 2 bytes)
0 commit comments