Fix dom.Parse: short internal buffer error for korean text#4
Open
cjkangme wants to merge 1 commit intogo-shiori:masterfrom
Open
Fix dom.Parse: short internal buffer error for korean text#4cjkangme wants to merge 1 commit intogo-shiori:masterfrom
cjkangme wants to merge 1 commit intogo-shiori:masterfrom
Conversation
Replace transform.Chain with nested NewReader in normalizeTextEncoding to prevent buffer overflow when processing documents with many non-ASCII characters (e.g., Korean, Japanese, Vietnamese). transform.Chain shares a single 4KB internal buffer between all transformers. When NFD normalization significantly expands text (common with Asian languages), this shared buffer can overflow, causing "transform: short internal buffer" error. By using nested NewReader, each transformer gets its own independent buffer, allowing proper streaming processing without buffer overflow. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Author
|
If you'd like me to add unit tests for this fix, please let me know and I'll be happy to write them. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hi, thank you for creating and maintaining this great library.
I recently encountered a
transform: short internal buffererror while parsing Korean web pages using markusmobius/go-trafilatura and go-shiori/go-readability, which depend on this repository.While investigating the error with Claude Code, I found that it originates from the
dom.Parsefunction, so I made a fix for it.I'm not very familiar with Go, so I was cautious about making changes. However, after testing independently and cross-verifying with multiple AIs, I believe this fix should work correctly. I'd appreciate your review.
Problem
When using
dom.Parse()on HTML documents containing large amounts of non-ASCII text (e.g., Korean), the following error occurs:Root Cause
transform.Chainshares a single 4KB internal buffer between all transformers. When NFD normalization expands text significantly (common with Asian languages where a single character can decompose into multiple code points), this shared buffer overflows.For example, Korean text like "삼성전자" expands significantly during NFD decomposition:
Since
transform.Chainstores intermediate results (after NFD, before the next transformer) in a shared buffer, this jamo decomposition can cause the intermediate data to exceed the 4KB buffer limit.Solution
Use nested
NewReaderinstead oftransform.Chain:Each transformer now has its own independent buffer, enabling proper streaming processing without buffer overflow.