Fix dom.Parse: short internal buffer error for korean text by cjkangme · Pull Request #4 · go-shiori/dom

cjkangme · 2026-01-25T08:45:58Z

Hi, thank you for creating and maintaining this great library.

I recently encountered a transform: short internal buffer error while parsing Korean web pages using markusmobius/go-trafilatura and go-shiori/go-readability, which depend on this repository.

While investigating the error with Claude Code, I found that it originates from the dom.Parse function, so I made a fix for it.

I'm not very familiar with Go, so I was cautious about making changes. However, after testing independently and cross-verifying with multiple AIs, I believe this fix should work correctly. I'd appreciate your review.

Problem

When using dom.Parse() on HTML documents containing large amounts of non-ASCII text (e.g., Korean), the following error occurs:

transform: short internal buffer

Root Cause

transform.Chain shares a single 4KB internal buffer between all transformers. When NFD normalization expands text significantly (common with Asian languages where a single character can decompose into multiple code points), this shared buffer overflows.

For example, Korean text like "삼성전자" expands significantly during NFD decomposition:

Original: 4 characters
After NFD: 11 characters - Each Korean syllable decomposes into 2-3 jamo characters ('삼' -> 'ㅅ', 'ㅏ', 'ㅁ')

Since transform.Chain stores intermediate results (after NFD, before the next transformer) in a shared buffer, this jamo decomposition can cause the intermediate data to exceed the 4KB buffer limit.

Solution

Use nested NewReader instead of transform.Chain:

// Before (shared 4KB buffer)
transformer := transform.Chain(norm.NFD, runes.Remove(softHyphenSet), norm.NFC)
return transform.NewReader(r, transformer)

// After (independent buffers per transformer)
r = transform.NewReader(r, norm.NFD)
r = transform.NewReader(r, runes.Remove(softHyphenSet))
r = transform.NewReader(r, norm.NFC)
return r

Each transformer now has its own independent buffer, enabling proper streaming processing without buffer overflow.

Replace transform.Chain with nested NewReader in normalizeTextEncoding to prevent buffer overflow when processing documents with many non-ASCII characters (e.g., Korean, Japanese, Vietnamese). transform.Chain shares a single 4KB internal buffer between all transformers. When NFD normalization significantly expands text (common with Asian languages), this shared buffer can overflow, causing "transform: short internal buffer" error. By using nested NewReader, each transformer gets its own independent buffer, allowing proper streaming processing without buffer overflow. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

cjkangme · 2026-01-25T09:43:21Z

If you'd like me to add unit tests for this fix, please let me know and I'll be happy to write them.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Fix dom.Parse: short internal buffer error for korean text#4

Fix dom.Parse: short internal buffer error for korean text#4
cjkangme wants to merge 1 commit intogo-shiori:masterfrom
cjkangme:fix/short-internal-buffer

cjkangme commented Jan 25, 2026

Uh oh!

cjkangme commented Jan 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

cjkangme commented Jan 25, 2026

Problem

Root Cause

Solution

Uh oh!

cjkangme commented Jan 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant