feat: improve CJK emphasis handling in markdown processing #331

3w36zj6 · 2025-10-16T15:01:55Z

In typst-docs, we use pulldown-cmark to parse Markdown. This library faithfully implements the CommonMark specification along with several dialects.

However, since the CommonMark spec is designed for languages that use word separation, its rules for emphasis are quite odd when applied to CJK markup.

Please add the following markup to an appropriate page:

**この強調は認識されません。**この文のせいで。

これは*強調*です。

- 非分かち書きでは**「鍵括弧」**を強調できない
- 非分かち書きでは**（丸括弧）**を強調できない
- 非分かち書きでは**“ダブルクオーテーション”**を強調できない
- 非分かち書きでは**句読点まで、**強調できない
- 非分かち書きでは**とても難しい？**強調

diff --git a/docs/overview.md b/docs/overview.md
index 9afe778c9..87e7d6c97 100644
--- a/docs/overview.md
+++ b/docs/overview.md
@@ -5,6 +5,16 @@ description: |
 
 # Typstについて
 
+**この強調は認識されません。**この文のせいで。
+
+これは*強調*です。
+
+- 非分かち書きでは**「鍵括弧」**を強調できない
+- 非分かち書きでは**（丸括弧）**を強調できない
+- 非分かち書きでは**“ダブルクオーテーション”**を強調できない
+- 非分かち書きでは**句読点まで、**強調できない
+- 非分かち書きでは**とても難しい？**強調
+
 <div class="info-box">
 
 **はじめに: Typst Japanese Communityより**

<div>
<h1>Typstについて</h1>
<p>**この強調は認識されません。**この文のせいで。</p>
<p>これは<em>強調</em>です。</p>
<ul>
<li>非分かち書きでは**「鍵括弧」**を強調できない</li>
<li>非分かち書きでは**（丸括弧）**を強調できない</li>
<li>非分かち書きでは**“ダブルクオーテーション”**を強調できない</li>
<li>非分かち書きでは**句読点まで、**強調できない</li>
<li>非分かち書きでは**とても難しい？**強調</li>
</ul>
</div>

In such cases, to apply emphasis, you need workarounds like inserting spaces or replacing with HTML tags.

To address this, we improved the process by inserting and removing appropriate spaces before and after Markdown parsing, so emphasis can be applied naturally without such workarounds.

References

Copilot

Pull Request Overview

This PR improves CJK (Chinese, Japanese, Korean) emphasis handling in markdown processing by inserting zero-width space HTML entities around emphasis markers when they are adjacent to CJK characters without spaces.

Adds preprocessing to detect CJK characters and insert HTML entities around emphasis markers
Implements post-processing to remove these entities from final HTML output
Includes comprehensive handling of code blocks and inline code spans to avoid affecting literal content

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-16T15:13:55Z

docs/src/html.rs

+    if s.contains("&amp;#8203;") {
+        *s = s.replace("&amp;#8203;", "");
+    }
+    if s.contains("&#8203;") {
+        *s = s.replace("&#8203;", "");
+    }
+    if s.contains('\u{200B}') {
+        *s = s.replace('\u{200B}', "");
+    }


Multiple string replacements create new String instances unnecessarily. Consider using replace_range or a single pass approach to avoid multiple allocations.

Suggested change

if s.contains("&#8203;") {

*s = s.replace("&#8203;", "");

}

if s.contains("") {

*s = s.replace("", "");

}

if s.contains('\u{200B}') {

*s = s.replace('\u{200B}', "");

}

// Patterns to remove

const AMP_ZWSP: &str = "&#8203;";

const ZWSP: &str = "";

const ZWSP_CHAR: char = '\u{200B}';

let mut out = String::with_capacity(s.len());

let mut i = 0;

while i < s.len() {

if s[i..].starts_with(AMP_ZWSP) {

i += AMP_ZWSP.len();

} else if s[i..].starts_with(ZWSP) {

i += ZWSP.len();

} else if s[i..].starts_with(ZWSP_CHAR) {

i += ZWSP_CHAR.len_utf8();

} else {

// Get the next char and push it

let ch = s[i..].chars().next().unwrap();

out.push(ch);

i += ch.len_utf8();

}

}

*s = out;

Copilot · 2025-10-16T15:13:55Z

docs/src/html.rs

+            continue;
+        }
+
+        let chars: Vec<char> = line.chars().collect();


Collecting all characters into a Vec for each line could be memory-intensive for large documents. Consider using char_indices() iterator to avoid the allocation while still supporting indexing.

Copilot · 2025-10-16T15:13:55Z

docs/src/html.rs

+
+    for line in input.lines() {
+        // detect fenced code block boundaries (```...)
+        if !in_fence && line.starts_with("```") {
+            in_fence = true;
+            fence_ticks = line.chars().take_while(|&c| c == '`').count();
+            out.push_str(line);
+            out.push('\n');
+            continue;
+        } else if in_fence && line.starts_with(&"`".repeat(fence_ticks)) {


The \"".repeat(fence_ticks)` creates a new String allocation on each line check inside fenced blocks. Consider pre-computing this string or using a more efficient comparison method.

Suggested change

for line in input.lines() {

// detect fenced code block boundaries (```...)

if !in_fence && line.starts_with("```") {

in_fence = true;

fence_ticks = line.chars().take_while(|&c| c == '`').count();

out.push_str(line);

out.push('\n');

continue;

} else if in_fence && line.starts_with(&"`".repeat(fence_ticks)) {

let mut fence_str = String::new();

for line in input.lines() {

// detect fenced code block boundaries (```...)

if !in_fence && line.starts_with("```") {

in_fence = true;

fence_ticks = line.chars().take_while(|&c| c == '`').count();

fence_str = "`".repeat(fence_ticks);

out.push_str(line);

out.push('\n');

continue;

} else if in_fence && line.starts_with(&fence_str) {

3w36zj6 · 2025-10-16T15:15:26Z

Hi @YDX-2147483647

What do you think about the idea in this PR? Could you please review it?

YDX-2147483647 · 2025-10-16T15:29:38Z

Hmm, I don't think I'm capable of reviewing this PR. I haven't written any parser before, so I can't offer advices better than LLM.

I have indeed been tortured by this markdown parsing problem, but my usual solution is changing the parser (or installing an extension to the parser) rather than patching it on my own…

3w36zj6 · 2025-10-16T15:40:06Z

Thank you for your comment.

I also believe that improving the parser logic is the smart way to address this. Since this is mainly a workaround, I don't feel confident enough to propose it upstream.

At the very least, if we can confirm that it doesn't break the documentation, it would be best to first operate this experimentally as a workaround within the Japanese community.

kimushun1101

I understood the situation and was able to verify the behavior.
However, I’m not as good as Copilot at reviewing my own source code.

I measured the time using the following command:
time mise run generate
The build time hasn’t increased noticeably compared to before.

gomazarashi

I’m not very familiar with the code itself, but I’ve confirmed that the documentation was generated correctly.
Thank you for addressing this issue.

ultimatile

Looks nice, thank you!

feat: improve CJK emphasis handling in markdown processing

731a679

3w36zj6 requested a review from Copilot October 16, 2025 15:12

3w36zj6 marked this pull request as ready for review October 16, 2025 15:12

3w36zj6 requested review from gomazarashi, kimushun1101 and ultimatile October 16, 2025 15:13

Copilot AI reviewed Oct 16, 2025

View reviewed changes

kimushun1101 approved these changes Oct 16, 2025

View reviewed changes

3w36zj6 mentioned this pull request Oct 18, 2025

/docs/reference/introspectionの強調表示が正しく表示されていないのを修正 #330

Closed

gomazarashi approved these changes Nov 4, 2025

View reviewed changes

ultimatile approved these changes Nov 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: improve CJK emphasis handling in markdown processing #331

feat: improve CJK emphasis handling in markdown processing #331

Uh oh!

3w36zj6 commented Oct 16, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 16, 2025

Uh oh!

Copilot AI Oct 16, 2025

Uh oh!

Copilot AI Oct 16, 2025

Uh oh!

3w36zj6 commented Oct 16, 2025

Uh oh!

YDX-2147483647 commented Oct 16, 2025

Uh oh!

3w36zj6 commented Oct 16, 2025

Uh oh!

kimushun1101 left a comment

Uh oh!

gomazarashi left a comment

Uh oh!

ultimatile left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

-    if s.contains("&amp;#8203;") {
-        *s = s.replace("&amp;#8203;", "");
-    }
-    if s.contains("&#8203;") {
-        *s = s.replace("&#8203;", "");
-    }
-    if s.contains('\u{200B}') {
-        *s = s.replace('\u{200B}', "");
-    }
+    // Patterns to remove
+    const AMP_ZWSP: &str = "&amp;#8203;";
+    const ZWSP: &str = "&#8203;";
+    const ZWSP_CHAR: char = '\u{200B}';
+    let mut out = String::with_capacity(s.len());
+    let mut i = 0;
+    while i < s.len() {
+        if s[i..].starts_with(AMP_ZWSP) {
+            i += AMP_ZWSP.len();
+        } else if s[i..].starts_with(ZWSP) {
+            i += ZWSP.len();
+        } else if s[i..].starts_with(ZWSP_CHAR) {
+            i += ZWSP_CHAR.len_utf8();
+        } else {
+            // Get the next char and push it
+            let ch = s[i..].chars().next().unwrap();
+            out.push(ch);
+            i += ch.len_utf8();
+        }
+    }
+    *s = out;

feat: improve CJK emphasis handling in markdown processing #331

Are you sure you want to change the base?

feat: improve CJK emphasis handling in markdown processing #331

Uh oh!

Conversation

3w36zj6 commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

References

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

3w36zj6 commented Oct 16, 2025

Uh oh!

YDX-2147483647 commented Oct 16, 2025

Uh oh!

3w36zj6 commented Oct 16, 2025

Uh oh!

kimushun1101 left a comment

Choose a reason for hiding this comment

Uh oh!

gomazarashi left a comment

Choose a reason for hiding this comment

Uh oh!

ultimatile left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

3w36zj6 commented Oct 16, 2025 •

edited

Loading