-
Notifications
You must be signed in to change notification settings - Fork 13
feat: improve CJK emphasis handling in markdown processing #331
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: improve CJK emphasis handling in markdown processing #331
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR improves CJK (Chinese, Japanese, Korean) emphasis handling in markdown processing by inserting zero-width space HTML entities around emphasis markers when they are adjacent to CJK characters without spaces.
- Adds preprocessing to detect CJK characters and insert HTML entities around emphasis markers
- Implements post-processing to remove these entities from final HTML output
- Includes comprehensive handling of code blocks and inline code spans to avoid affecting literal content
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
if s.contains("​") { | ||
*s = s.replace("​", ""); | ||
} | ||
if s.contains("​") { | ||
*s = s.replace("​", ""); | ||
} | ||
if s.contains('\u{200B}') { | ||
*s = s.replace('\u{200B}', ""); | ||
} |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiple string replacements create new String instances unnecessarily. Consider using replace_range
or a single pass approach to avoid multiple allocations.
if s.contains("​") { | |
*s = s.replace("​", ""); | |
} | |
if s.contains("​") { | |
*s = s.replace("​", ""); | |
} | |
if s.contains('\u{200B}') { | |
*s = s.replace('\u{200B}', ""); | |
} | |
// Patterns to remove | |
const AMP_ZWSP: &str = "​"; | |
const ZWSP: &str = "​"; | |
const ZWSP_CHAR: char = '\u{200B}'; | |
let mut out = String::with_capacity(s.len()); | |
let mut i = 0; | |
while i < s.len() { | |
if s[i..].starts_with(AMP_ZWSP) { | |
i += AMP_ZWSP.len(); | |
} else if s[i..].starts_with(ZWSP) { | |
i += ZWSP.len(); | |
} else if s[i..].starts_with(ZWSP_CHAR) { | |
i += ZWSP_CHAR.len_utf8(); | |
} else { | |
// Get the next char and push it | |
let ch = s[i..].chars().next().unwrap(); | |
out.push(ch); | |
i += ch.len_utf8(); | |
} | |
} | |
*s = out; |
Copilot uses AI. Check for mistakes.
continue; | ||
} | ||
|
||
let chars: Vec<char> = line.chars().collect(); |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Collecting all characters into a Vec for each line could be memory-intensive for large documents. Consider using char_indices()
iterator to avoid the allocation while still supporting indexing.
Copilot uses AI. Check for mistakes.
|
||
for line in input.lines() { | ||
// detect fenced code block boundaries (```...) | ||
if !in_fence && line.starts_with("```") { | ||
in_fence = true; | ||
fence_ticks = line.chars().take_while(|&c| c == '`').count(); | ||
out.push_str(line); | ||
out.push('\n'); | ||
continue; | ||
} else if in_fence && line.starts_with(&"`".repeat(fence_ticks)) { |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The \"
".repeat(fence_ticks)` creates a new String allocation on each line check inside fenced blocks. Consider pre-computing this string or using a more efficient comparison method.
for line in input.lines() { | |
// detect fenced code block boundaries (```...) | |
if !in_fence && line.starts_with("```") { | |
in_fence = true; | |
fence_ticks = line.chars().take_while(|&c| c == '`').count(); | |
out.push_str(line); | |
out.push('\n'); | |
continue; | |
} else if in_fence && line.starts_with(&"`".repeat(fence_ticks)) { | |
let mut fence_str = String::new(); | |
for line in input.lines() { | |
// detect fenced code block boundaries (```...) | |
if !in_fence && line.starts_with("```") { | |
in_fence = true; | |
fence_ticks = line.chars().take_while(|&c| c == '`').count(); | |
fence_str = "`".repeat(fence_ticks); | |
out.push_str(line); | |
out.push('\n'); | |
continue; | |
} else if in_fence && line.starts_with(&fence_str) { |
Copilot uses AI. Check for mistakes.
What do you think about the idea in this PR? Could you please review it? |
Hmm, I don't think I'm capable of reviewing this PR. I haven't written any parser before, so I can't offer advices better than LLM. I have indeed been tortured by this markdown parsing problem, but my usual solution is changing the parser (or installing an extension to the parser) rather than patching it on my own… |
Thank you for your comment. I also believe that improving the parser logic is the smart way to address this. Since this is mainly a workaround, I don't feel confident enough to propose it upstream. At the very least, if we can confirm that it doesn't break the documentation, it would be best to first operate this experimentally as a workaround within the Japanese community. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understood the situation and was able to verify the behavior.
However, I’m not as good as Copilot at reviewing my own source code.
I measured the time using the following command:
time mise run generate
The build time hasn’t increased noticeably compared to before.
In typst-docs, we use pulldown-cmark to parse Markdown. This library faithfully implements the CommonMark specification along with several dialects.
However, since the CommonMark spec is designed for languages that use word separation, its rules for emphasis are quite odd when applied to CJK markup.
Please add the following markup to an appropriate page:
In such cases, to apply emphasis, you need workarounds like inserting spaces or replacing with HTML tags.
To address this, we improved the process by inserting and removing appropriate spaces before and after Markdown parsing, so emphasis can be applied naturally without such workarounds.
References
**
がたまに認識されないので緩和用プラグイン作った #JavaScript - Qiita