Skip to content

Specify all presentation sequences #114

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open

Specify all presentation sequences #114

wants to merge 9 commits into from

Conversation

MDLC01
Copy link
Collaborator

@MDLC01 MDLC01 commented Jul 16, 2025

This PR fixes #21, fixes #23, and closes #25, by adding the appropriate variation selectors to symbols that exist in both sym and emoji. I verified that all the variation sequences are defined in by Unicode.12

I have marked this PR as a draft because the next step is to add variation selectors to all symbols that allow it, whether present in both sym and emoji or not, to prevent ambiguity.

This made me realize that some emojis are poorly named, but improving this is a task for a separate PR.

Related: typst/typst#6489 (comment).

Footnotes

  1. https://www.unicode.org/reports/tr51/#Emoji_Variation_Sequences

  2. https://www.unicode.org/Public/UCD/latest/ucd/emoji/emoji-variation-sequences.txt

@Enivex
Copy link
Collaborator

Enivex commented Jul 17, 2025

This made me realize that some emojis are poorly named,

It's a lot of them. My recentish pr only covered a small part.

@T0mstone
Copy link
Collaborator

I'll be the first to bring up automation here:
Whether we want to automate this in the future or not, I think we should definitely have all of these hard-coded in the source code, since such automation would probably take a lot of time that we shouldn't have to pay on each build.
The automation would then only be an optional verification step like in #9. (And maybe we always enable it in CI)

That means any concerns about such automation should not block this PR.

@MDLC01
Copy link
Collaborator Author

MDLC01 commented Jul 24, 2025

I had to find a way to be confident in the content of this PR anyway, so I figured I could make tests so that the automation part is already done. I left a commit with the tests failing, so that you can see the output in CI. Notably, there was one mistake introduced in ae98eb0: I added the text presentation selector to sym.dash.wave instead of sym.dash.wave.double.

For now, I query the list of presentation sequences defined by Unicode from the internet every time the tests are run. Feel free to suggest a better way.

@MDLC01 MDLC01 marked this pull request as ready for review July 24, 2025 21:40
@MDLC01 MDLC01 changed the title Add variation selectors to symbols that exist in both sym and emoji Specify all presentation sequences Jul 24, 2025
@MDLC01
Copy link
Collaborator Author

MDLC01 commented Jul 24, 2025

This PR now contains meta changes.

@MDLC01 MDLC01 added the meta Discussion about the structure of this repo label Jul 24, 2025
@T0mstone
Copy link
Collaborator

T0mstone commented Aug 3, 2025

For now, I query the list of presentation sequences defined by Unicode from the internet every time the tests are run. Feel free to suggest a better way.

How about caching it in a file cache/presentation-sequences.txt and adding /cache/presentation-sequences.txt to the .gitignore?

Also, I'd like the web-request part of this test to be opt-in to begin with (the test can still run without it if the cache file already exists). IMO running cargo test shouldn't perform any web requests without the user's consent. A crate feature for this would also have the advantage of making the reqwest dependency optional.

@MDLC01
Copy link
Collaborator Author

MDLC01 commented Aug 3, 2025

Initially I thought it may be better to download the file as part of a build script for tests only (which would cache it for as long as the source code is not modified), but I'm not sure how to run something only for building tests? Also, maybe this is a bad idea for some other reason. I think your solution is probably better anyway. Maybe the file could even be part of the repo so that it doesn't need to be downloaded every time, but somehow that doesn't feel right to me. Also, that might have some licensing issues (we would probably need to include https://www.unicode.org/license.txt as well).

Additionally, I want to clarify that the dependency on reqwest is for tests only.

@T0mstone
Copy link
Collaborator

T0mstone commented Aug 3, 2025

Additionally, I want to clarify that the dependency on reqwest is for tests only.

Yeah I got that, but it's still a huge dep tree that not everyone may want to have to download before running [the rest of] the tests.

@laurmaedje
Copy link
Member

If we're gonna go with the downloading thing, then ureq would already be a lot smaller than reqwest.

@MDLC01
Copy link
Collaborator Author

MDLC01 commented Aug 5, 2025

I switched to ureq. Additionally, the file is now pinned to Unicode 16.0.0 so as to prevent sudden breakage when a new Unicode version releases. The file is now downloaded in build.rs. To prevent always having ureq as a build dependency, I hid the tests that require it behind a non-default _test-unicode-conformance feature and added it to CI.

@MDLC01
Copy link
Collaborator Author

MDLC01 commented Aug 5, 2025

According to The Cargo Book, it is not possible to have a build dependency for tests only, so this feature trick is necessary:

The same applies to cfg(debug_assertions), cfg(test) and cfg(proc_macro). [...] There is currently no way to add dependencies based on these configuration values.

src/lib.rs Outdated
@@ -285,6 +287,8 @@ mod test {
.collect::<HashSet<_>>();
assert!(
are_all_variants_valid(EMOJI, |c| {
// All text presentations are exactly 2 codepoints long as of
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this say "emoji" instead of "text"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed to "emoji variation sequences" everywhere, as this is how text/emoji presentation sequences seem to be referred to by Unicode.

@Enivex
Copy link
Collaborator

Enivex commented Aug 12, 2025

Would it be possible to have some shorthand that specifies text vs emoji form? \u{FE0E} isn't particularly readable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
meta Discussion about the structure of this repo
Projects
None yet
4 participants