Skip to content

ICU-11443 WIP: Link detection according to UTS#58.#3878

Open
arnt wants to merge 1 commit intounicode-org:mainfrom
arnt:uts58-link-detection
Open

ICU-11443 WIP: Link detection according to UTS#58.#3878
arnt wants to merge 1 commit intounicode-org:mainfrom
arnt:uts58-link-detection

Conversation

@arnt
Copy link
Copy Markdown

@arnt arnt commented Feb 27, 2026

This is a UTS58 link detector in Java. I haven't done the C++ side yet.

The new code is in .../LinkDetector.java and supporting files.

I also haven't done the PR submission chores or looked for a JIRA issue. But it's Friday, 17:25, I need to be somewhere at 18:30, and I want to go out feeling the joy of closure.

My rough plan is to look for a JIRA ticket and do other chores on Tuesday, update the Java implementation according to comments, and when the Java implementation looks good to merge I'll write a corresponding ICU4C implementation. At that point I'll also extend the Ruby implementation on which this is based.

For now this is just to give you a look at the code.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Feb 27, 2026

CLA assistant check
All committers have signed the CLA.

@arnt arnt changed the title WIP: Link detection according to UTS#58. ICU-11143 WIP: Link detection according to UTS#58. Mar 3, 2026
@arnt arnt changed the title ICU-11143 WIP: Link detection according to UTS#58. ICU-11443 WIP: Link detection according to UTS#58. Mar 3, 2026
@markusicu markusicu self-assigned this Mar 5, 2026
@markusicu markusicu requested review from macchiati and markusicu March 5, 2026 17:55
@@ -0,0 +1,71 @@
// © 2025 and later: Unicode, Inc. and others.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most Unicode properties are supported more directly in ICU, so that additional files and parsing code are not necessary. Need to check with @markusicu as to whether the UTS58 properties are or will be.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. If they are, I assume most or all of this can be dropped.

@@ -0,0 +1,210 @@
// © 2025 and later: Unicode, Inc. and others.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be better architected as a plain text file matching the table on https://www.iana.org/domains/root/db, and then changing this to just construct a HashSet statically from that file. (and maybe no .py file). That way it is a simple drop-in.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I optimised for runtime performance. There's much to say for a simple drop-in.

@@ -0,0 +1,444 @@
// © 2016 and later: Unicode, Inc. and others.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests like these are better structure by having a plain text file that can be deployed across different implementations including different programming languages.

The file can have a simple structure, a series of inputs and expected outputs. Either as a semicolon delimited file, or something with a simple format like JSON

@macchiati
Copy link
Copy Markdown
Member

BTW, just taking a quick look; will need to dig into the guts of the code later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants