Skip to content
This repository was archived by the owner on Dec 28, 2025. It is now read-only.

Conversation

@ChALkeR
Copy link

@ChALkeR ChALkeR commented Dec 16, 2025

An experiment

Fixes: #22
Fixes: #23

  • Fixes BOM handling for utf-8
  • Fixes replacement support for utf-16
  • Fixes utf-8 mistakes on non-Node.js
    (iconv-lite Buffer usage maps to https://npmjs.com/buffer which has discrepancies)
  • Fixes windows-1252 missing chars (and other legacy single-byte)
  • Adds support for all missing encodings
  • Closes all other known incompatibilities with the encoding spec

This is also faster and ~2x smaller than iconv-lite.

I kept the API compatible (including label names)

This brings in some differences though which might be a blocker for now or at least a reason for a semver-major:

  • The dependency is ESM. That raises the minimum Node.js version requirement to ^20.19.0 || >=22.13.0 (from 18 currently supported by this module). 18 is EOL per Node.js release schedule
  • This also adds support for replacement, as it is expected to be supported in the hook for standards.
    Anything using this module to make a TextDecoder polyfill could get support for replacement unexpectedly.
    Users should check that they are not using replacement manually.
    An alternative to be specifically block it from being used via this API.

I suggest to wait until an 1.0.0 release before landing (hence a draft), but I wanted to file this as a place to get comments

@domenic
Copy link
Member

domenic commented Dec 17, 2025

Thanks for working on this! However, since it'd be semver-major anyway, I think it's best to just deprecate this package, and use exodus/bytes directly in the jsdom ecosystem.

@ChALkeR
Copy link
Author

ChALkeR commented Dec 17, 2025

@domenic Yeah, that's ok
This is also a demo of the API differences / completeness

@ChALkeR

This comment was marked as resolved.

@ChALkeR ChALkeR closed this Dec 17, 2025
@ChALkeR ChALkeR force-pushed the chalker/exodus-bytes branch from ea2a8cb to 00690a4 Compare December 17, 2025 09:31
@ChALkeR ChALkeR reopened this Dec 17, 2025
@ChALkeR
Copy link
Author

ChALkeR commented Dec 17, 2025

Update:

  1. engines are now compatible with jsdom
  2. normalizeEncoding now doesn't throw but returns null on invalid encoding as labelToName here
    That removes try-catch

@ChALkeR ChALkeR force-pushed the chalker/exodus-bytes branch from 00690a4 to 15cd210 Compare December 17, 2025 09:43
@ChALkeR
Copy link
Author

ChALkeR commented Dec 17, 2025

@domenic I checked usage in jsdom. It also happens through https://npmjs.com/html-encoding-sniffer, which also needs labelToName and expects cased names.

It would be easier to keep lowercase -> cased mapping in a single place, at least while migrating

And then perhaps switch to all-lowercase names directly?
I don't think that cased encoding identifiers are a part of the spec and don't want to maintain them in @exodus/bytes for simplicity and bundle size

While semver-major, this is still a drop-in replacement
It will also be a semver-major for https://npmjs.com/html-encoding-sniffer, due to engines
It won't be a semver-major for jsdom

Also there might be other usage in the ecosystem that would benefit from a switch to a fixed implementation, and it would be easier to do that without having to switch APIs

@domenic
Copy link
Member

domenic commented Dec 17, 2025

I don't think that cased encoding identifiers are a part of the spec

They are? The encoding name concept is defined here https://encoding.spec.whatwg.org/#name and https://encoding.spec.whatwg.org/#names-and-labels is pretty clear that the names are cased, e.g. when it says

For each encoding, ASCII-lowercasing its name yields one of its labels.

But anyway, I'm happy to move to lowercased names throughout the jsdom ecosystem. Although I prefer jsdom's style as it matches the standard better, it doesn't affect any important user-facing behavior, and the benefit of removing an abstraction layer is high.

So, we can do a semver-major rev of html-encoding-sniffer to return the lowercased names instead of the canonical names, and to bump the engines requirements. The jsdom ecosystem treats semver major bumps as cheap so I'm not really worried about migration costs.

@ChALkeR
Copy link
Author

ChALkeR commented Dec 17, 2025

For each encoding, ASCII-lowercasing its name yields one of its labels.

Hm, true, that implies that name is a string!

Other that that note, those are not exposed anywhere though and could be treated as enums.
The table doesn't list them as strings but as identifiers.

Also:

If these protocols and formats need to expose the encoding’s name or label, they must expose it as "utf-8".

@ChALkeR
Copy link
Author

ChALkeR commented Dec 18, 2025

@domenic I'm adding some more tests to cover all known browser discrepancies (some of them are already fixed in Chrome and WebKit) and then planning to release a v1.0.0 of @exodus/bytes, after which breaking API changes (e.g. in exported helper methods) would be semver-major

Are there any changes you want me to land before then?
The helper methods usage / compat is demonstrated in this PR

@domenic
Copy link
Member

domenic commented Dec 19, 2025

Overall it seems great! I guess maybe adding some documentation for those exports would be helpful, especially around your custom concept of "canonicalized encoding label" that your library is based around (i.e., what it uses instead of the spec's encoding names). But that's not a breaking change. I'm excited to use this to fix such long-standing issues in the jsdom ecosystem!

@ChALkeR
Copy link
Author

ChALkeR commented Dec 20, 2025

@domenic I just published v1.0.0

Added docs on hooks: https://github.com/exodusoss/bytes#exodusbytesencodingjs
Also added more tests and fixed an instance of Error -> TypeError.
Otherwise, no significant changes.

@domenic
Copy link
Member

domenic commented Dec 28, 2025

Closing all issues and PRs as this package is now deprecated.

@domenic domenic closed this Dec 28, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feff should not be stripped for legacy multi-byte encodings and utf-8 Document discrepancies with TextDecoder

2 participants