Skip to content

Use UTF-8 in item metadata and JSON serialization#17010

Open
ryvnf wants to merge 2 commits intoluanti-org:masterfrom
ryvnf:serialize-with-utf8
Open

Use UTF-8 in item metadata and JSON serialization#17010
ryvnf wants to merge 2 commits intoluanti-org:masterfrom
ryvnf:serialize-with-utf8

Conversation

@ryvnf
Copy link
Contributor

@ryvnf ryvnf commented Mar 9, 2026

Goal of the PR

Reduce size of itemstrings that contain non-ascii unicode characters. On master for example "☺" is encoded as "\u00e2\u0098\u00ba" in itemstrings, which is 6 times larger.

How does the PR work

  1. Makes core.write_json emit UTF-8 using the emitUTF8 setting
  2. Removes old code that transformed non-ascii characters into \u00XX for item and node metadata
  3. core.serialize already preserves UTF-8 encoded text so no change was necessary there
  4. ASCII control characters are still escaped in metadata using \u00XX so metadata separators like \x01 \x02 can be nested

Does it resolve any reported issue?

Implements what was suggested in #17007

Does this relate to a goal in the roadmap?

Probably not.

If not a bug fix, why is this PR needed? What usecases does it solve?

  1. Makes it more difficult for users to spam unicode characters in item metadata like text for books to cause lag and disrupt servers.
  2. Makes the storage size of metadata truncated with string.len consistent. For example, in Minetest Game the amount of text a user can enter into a book is limited to 10kB using string.len. If a user decides to spam the book with ☺ they can create a book itemstring which contains 60kB instead of 10kB as you would expect.
  3. Generally reduces amount of data needed to be sent over network

If you have used an LLM/AI to help with code or assets, you must disclose this.

I have not.

Todo

  • Fix failing unit test
  • Verify itemstrings with metadata do not get corrupted when nesting (tested using Mineclonia enchanted items inside shulker boxes)

How to test

Do the following in devtest

  • Fetch the "Item Meta Editor" from the bag of everything
  • Use it with an item placed next to it, add metadata with key and value containing unicode characters like ☺♥★
  • Verify it works
  • Use the new dump_itemstring command and verify it contains "☺" instead of "\u00e2\u0098\u00ba"
  • Use "Node Meta Editor" on a node and store unicode characters
  • Quit the game and restart and verify that all data is still there

There may be other things that need testing that I have not considered.

@ryvnf ryvnf force-pushed the serialize-with-utf8 branch from 310a693 to 098db59 Compare March 9, 2026 22:33
@ryvnf ryvnf changed the title Use utf8 in metadata and JSON serialization Use UTF-8 in metadata and JSON serialization Mar 9, 2026
@sfan5 sfan5 added the Action / change needed Code still needs changes (PR) / more information requested (Issues) label Mar 9, 2026
@ryvnf ryvnf changed the title Use UTF-8 in metadata and JSON serialization WIP: Use UTF-8 in metadata and JSON serialization Mar 9, 2026
@ryvnf ryvnf marked this pull request as draft March 9, 2026 22:51
@Zughy Zughy added Feature ✨ PRs that add or enhance a feature Roadmap: Needs approval The change is not part of the current roadmap and needs to be approved by coredevs beforehand @ Script API labels Mar 10, 2026
@ryvnf ryvnf force-pushed the serialize-with-utf8 branch from 098db59 to 2bbf30d Compare March 10, 2026 18:21
@sfan5 sfan5 added Roadmap: supported by core dev PR not adhering to the roadmap, yet some core dev decided to take care of it and removed Action / change needed Code still needs changes (PR) / more information requested (Issues) Roadmap: Needs approval The change is not part of the current roadmap and needs to be approved by coredevs beforehand labels Mar 10, 2026
@ryvnf ryvnf force-pushed the serialize-with-utf8 branch from 2bbf30d to 25cfddc Compare March 10, 2026 18:52
@ryvnf
Copy link
Contributor Author

ryvnf commented Mar 10, 2026

I also added a dump_itemstring command to devtest. Not 100% sure if it should be there or not in master. Added it since without it you cannot check how the string is encoded. It could be removed after testing but before merging if that is preferable.

@ryvnf ryvnf marked this pull request as ready for review March 10, 2026 19:30
@ryvnf ryvnf changed the title WIP: Use UTF-8 in metadata and JSON serialization Use UTF-8 in metadata and JSON serialization Mar 10, 2026
@ryvnf
Copy link
Contributor Author

ryvnf commented Mar 10, 2026

The test that failed did so by a 500 Internal Server Error to github.com. That is unrelated to the changes. I cannot restart it to fix it.

Edit: appears to have fixed itself

Copy link
Member

@sfan5 sfan5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ryvnf ryvnf force-pushed the serialize-with-utf8 branch from 25cfddc to 803da3c Compare March 11, 2026 16:32
@sfan5 sfan5 changed the title Use UTF-8 in metadata and JSON serialization Use UTF-8 in item metadata and JSON serialization Mar 11, 2026
Copy link
Member

@SmallJoker SmallJoker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works, tested with #12167 (comment) using the sample string:

local str = "baum\xe4\x01\xF5\xA3birne"

and libjsoncpp26, version 1.9.6-3

Result

it's the same
str:        	98	97	117	109	228	1	245	163	98	105	114	110	101
str_loaded: 	98	97	117	109	228	1	245	163	98	105	114	110	101

Works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Feature ✨ PRs that add or enhance a feature Roadmap: supported by core dev PR not adhering to the roadmap, yet some core dev decided to take care of it @ Script API >= Two approvals ✅ ✅

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants