Skip to content

Comments

Fix dom.Parse: short internal buffer error for korean text#4

Open
cjkangme wants to merge 1 commit intogo-shiori:masterfrom
cjkangme:fix/short-internal-buffer
Open

Fix dom.Parse: short internal buffer error for korean text#4
cjkangme wants to merge 1 commit intogo-shiori:masterfrom
cjkangme:fix/short-internal-buffer

Conversation

@cjkangme
Copy link

Hi, thank you for creating and maintaining this great library.

I recently encountered a transform: short internal buffer error while parsing Korean web pages using markusmobius/go-trafilatura and go-shiori/go-readability, which depend on this repository.

While investigating the error with Claude Code, I found that it originates from the dom.Parse function, so I made a fix for it.

I'm not very familiar with Go, so I was cautious about making changes. However, after testing independently and cross-verifying with multiple AIs, I believe this fix should work correctly. I'd appreciate your review.

Problem

When using dom.Parse() on HTML documents containing large amounts of non-ASCII text (e.g., Korean), the following error occurs:

transform: short internal buffer

Root Cause

transform.Chain shares a single 4KB internal buffer between all transformers. When NFD normalization expands text significantly (common with Asian languages where a single character can decompose into multiple code points), this shared buffer overflows.

For example, Korean text like "삼성전자" expands significantly during NFD decomposition:

  • Original: 4 characters
  • After NFD: 11 characters - Each Korean syllable decomposes into 2-3 jamo characters ('삼' -> 'ㅅ', 'ㅏ', 'ㅁ')

Since transform.Chain stores intermediate results (after NFD, before the next transformer) in a shared buffer, this jamo decomposition can cause the intermediate data to exceed the 4KB buffer limit.

Solution

Use nested NewReader instead of transform.Chain:

// Before (shared 4KB buffer)
transformer := transform.Chain(norm.NFD, runes.Remove(softHyphenSet), norm.NFC)
return transform.NewReader(r, transformer)

// After (independent buffers per transformer)
r = transform.NewReader(r, norm.NFD)
r = transform.NewReader(r, runes.Remove(softHyphenSet))
r = transform.NewReader(r, norm.NFC)
return r

Each transformer now has its own independent buffer, enabling proper streaming processing without buffer overflow.

Replace transform.Chain with nested NewReader in normalizeTextEncoding
to prevent buffer overflow when processing documents with many non-ASCII
characters (e.g., Korean, Japanese, Vietnamese).

transform.Chain shares a single 4KB internal buffer between all
transformers. When NFD normalization significantly expands text
(common with Asian languages), this shared buffer can overflow,
causing "transform: short internal buffer" error.

By using nested NewReader, each transformer gets its own independent
buffer, allowing proper streaming processing without buffer overflow.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@cjkangme
Copy link
Author

If you'd like me to add unit tests for this fix, please let me know and I'll be happy to write them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant