Skip to content

perf(runtime): rewrite case-change in C++ to skip UTF-8 round-trip#26773

Open
cirospaciari wants to merge 2 commits intomainfrom
ciro/case-change-cpp
Open

perf(runtime): rewrite case-change in C++ to skip UTF-8 round-trip#26773
cirospaciari wants to merge 2 commits intomainfrom
ciro/case-change-cpp

Conversation

@cirospaciari
Copy link
Member

@cirospaciari cirospaciari commented Feb 6, 2026

Summary

Follow-up to #26772, which introduced the 11 case-changing utility methods in Zig.

  • Rewrites the implementation from Zig to C++, eliminating two unnecessary allocations and transcoding steps per call
  • The Zig implementation converted every JS string to UTF-8 via bunstr.toUTF8(allocator), processed codepoints, then converted back via bun.String.cloneUTF8(result_bytes). The new C++ implementation works directly with JSC's native string encoding (Latin1 or UTF-16) using StringView, StringBuilder, and ICU — same pattern as stripANSI.cpp
  • New CaseChange.cpp + CaseChange.h with the algorithm templated on Latin1Character/UChar; 11 functions registered directly in bunObjectTable as C++ host functions; deleted string_case.zig and removed the icu_toUpper/icu_toLower C-extern bridge wrappers

Test plan

  • All 2098 existing case-change tests pass (bun bd test test/js/bun/util/case-change.test.ts)
  • No regressions in other tests

Changelog

cirospaciari and others added 2 commits February 6, 2026 00:31
…5087)

Add Bun.camelCase, pascalCase, snakeCase, kebabCase, constantCase,
dotCase, capitalCase, trainCase, pathCase, sentenceCase, and noCase
matching the change-case npm package. Uses ICU for full Unicode support
and bun.strings.UnsignedCodepointIterator for codepoint iteration.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…-trip

Rewrites the 11 case-changing utility methods (camelCase, pascalCase,
snakeCase, etc.) from Zig to C++, eliminating two unnecessary
allocations and transcoding steps.

The Zig implementation converted every JS string to UTF-8 via
bunstr.toUTF8(allocator), processed codepoints, then converted back
via bun.String.cloneUTF8(result_bytes) — two unnecessary allocations
+ transcoding for every call.

Move to C++ and work directly with the JSC string's native encoding
(Latin1 or UTF-16) using StringView, StringBuilder, and ICU — same
pattern as stripANSI.cpp.

- New: CaseChange.cpp + CaseChange.h with the case-change algorithm
  templated on Latin1Character/UChar
- Wiring: 11 functions registered directly in bunObjectTable as C++
  host functions
- Cleanup: Deleted string_case.zig and all Zig/C++ bridge wiring
  (icu_toUpper/icu_toLower wrappers)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@cirospaciari cirospaciari requested a review from alii as a code owner February 6, 2026 08:57
@robobun
Copy link
Collaborator

robobun commented Feb 6, 2026

Updated 1:46 AM PT - Feb 6th, 2026

@cirospaciari, your commit a669587 has 3 failures in Build #36644 (All Failures):


🧪   To try this PR locally:

bunx bun-pr 26773

That installs a local version of the PR into your bun-26773 executable, so you can run:

bun-26773 --bun

@github-actions github-actions bot added the claude label Feb 6, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 6, 2026

Walkthrough

Adds eleven string case-transformation functions (camelCase, pascalCase, snakeCase, kebabCase, constantCase, dotCase, capitalCase, trainCase, pathCase, sentenceCase, noCase) to Bun. Includes TypeScript declarations, native C++ implementations with character classification and separator logic, registration in the Bun object, and comprehensive test suite validating cross-compatibility.

Changes

Cohort / File(s) Summary
TypeScript Type Declarations
packages/bun-types/bun.d.ts
Added type signatures for 11 string case-transformation functions to the Bun module, each accepting a string input and returning a string output.
C++ Implementation
src/bun.js/bindings/CaseChange.h, src/bun.js/bindings/CaseChange.cpp
Implemented case conversion logic with character classification, word boundary detection, and per-style separators. Created 11 host functions wrapping a templated convertCase function supporting both Latin1 and UTF-16 string inputs.
Bun Object Integration
src/bun.js/bindings/BunObject.cpp
Registered 11 new case-transformation functions in the BunObject's export lookup table and included the CaseChange header.
Test Suite
test/js/bun/util/case-change.test.ts
Added comprehensive test coverage including compatibility matrix against the change-case library, round-trip conversions, idempotency checks, edge cases (empty strings, unicode, acronyms), and error handling validation.
🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: rewriting case-change functions from Zig to C++ to eliminate UTF-8 round-trips, which aligns with the substantial refactoring across multiple files (CaseChange.cpp/h, BunObject.cpp, and test additions).
Description check ✅ Passed PR description includes both required sections with clear explanation of changes and test plan, though changelog placeholder is empty.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/bun.js/bindings/CaseChange.cpp`:
- Around line 236-266: The per-codepoint mapping using u_toupper/u_tolower
inside the loop (see getTransform, WordTransform, u_toupper, u_tolower,
builder.append, CharType/Latin1Character) must be replaced with ICU full-string
case mappings to handle multi-codepoint expansions and context/locale rules;
extract the word slice from input, call u_strToUpper/u_strToLower or use
UCaseMap to transform the whole word (or transform first character + remainder
for Capitalize) into a temporary UTF-16/UTF-32 buffer, then append that
transformed string to builder (handling length changes and encoding conversion)
instead of appending per-codepoint results. Ensure Capitalize semantics use
full-mapping for the first grapheme/character and full-mapping lowercase for the
rest, and thread locale/flags through the UCaseMap calls if applicable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants