-
Notifications
You must be signed in to change notification settings - Fork 34
feat: Add PDF utility package with functionalities for HTML conversio… #225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…n, text extraction, merging, and splitting, along with corresponding tests and documentation.
Signed-off-by: Abin Antu <abincantug@gmail.com>
WalkthroughIntroduces a new PDF utilities package with four public functions for HTML-to-PDF conversion, text extraction, merging, and splitting PDFs. Includes comprehensive documentation, examples, tests, and updates Go toolchain to 1.24.1 with corresponding dependency updates. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@go.mod`:
- Around line 15-28: The go.mod lists github.com/go-pdf/fpdf,
github.com/ledongthuc/pdf, and github.com/pdfcpu/pdfcpu as indirect even though
they are directly imported in pdf/pdf.go and pdf/pdf_test.go; fix by running "go
mod tidy" to regenerate module requirements so those packages are recorded as
direct requirements (or manually remove the "// indirect" comments and update
the require block), ensuring the imports in pdf/pdf.go and pdf/pdf_test.go match
the go.mod entries.
In `@pdf/pdf_test.go`:
- Line 20: The test uses string(rune('0'+i)) inside the p.CellFormat call which
only produces correct single digits for i<10; replace that expression with a
formatted decimal using fmt.Sprintf (e.g., build the page text with
fmt.Sprintf("Page content for page %d", i)) and add "fmt" to the import block so
the test prints correct page numbers for i >= 10; update the p.CellFormat
invocation in pdf_test.go accordingly.
In `@pdf/pdf.go`:
- Around line 64-127: renderNode currently resets fonts to a hard-coded
Arial/""/12 which breaks nested inline styles; fix by adding a font state stack
(e.g., type fontState { family, style string; size float64 }) and modify
renderNode to push the current fontState when entering an element (use for
inline tags like "b","strong","i","em","a") and pop/restore it after processing
children, updating calls that set fonts (SetFont) to update the top of the stack
rather than directly resetting to Arial/""/12; ensure renderImage and
block-level resets still set an explicit base state when appropriate.
- Around line 194-212: MergePDFs currently mutates the caller's input slice by
writing cleaned paths back into inputFiles[i]; instead, create and use a new
local slice (e.g., cleanedInputs or inputsCopy) to store filepath.Clean results
and file existence checks, leaving the original inputFiles untouched, and then
pass that new slice into api.MergeCreateFile; refer to the MergePDFs function
and the inputFiles variable when applying this change.
🧹 Nitpick comments (3)
pdf/pdf_test.go (1)
275-376: Good coverage forSplitPDF.Covers single range, multiple ranges, single page, empty/invalid/reversed ranges, and nonexistent file handling. Consider adding a test for a page range that exceeds the PDF's actual page count to verify how
pdfcpu.TrimFilebehaves in that scenario.pdf/pdf.go (1)
130-152:renderImagesilently swallows errors — consider logging or returning an error indicator.When
srcis empty (line 139) or the file doesn't exist (line 146), the function silently returns without any indication to the caller. For a utility library, silently skipping broken image references may surprise users. Consider at minimum documenting this behavior, or returning a boolean/error thatrenderNodecan act on.Also, note that
filepath.Cleanalone does not restrict the path to a safe directory — if untrusted HTML is ever processed, this could read arbitrary local files. Worth documenting thatConvertHTMLToPDFshould only be used with trusted HTML content.pdf/README.md (1)
1-12: LGTM — consider adding a brief "Limitations" section.The README accurately documents the four functions. You may want to add a short note about limitations (no CSS/JS support, basic HTML only) to set expectations, since
ConvertHTMLToPDF's godoc mentions these constraints but the README doesn't.
| require ( | ||
| github.com/clipperhouse/uax29/v2 v2.2.0 // indirect | ||
| github.com/go-pdf/fpdf v0.9.0 // indirect | ||
| github.com/hhrutter/lzw v1.0.0 // indirect | ||
| github.com/hhrutter/pkcs7 v0.2.0 // indirect | ||
| github.com/hhrutter/tiff v1.0.2 // indirect | ||
| github.com/ledongthuc/pdf v0.0.0-20250511090121-5959a4027728 // indirect | ||
| github.com/mattn/go-runewidth v0.0.19 // indirect | ||
| github.com/pdfcpu/pdfcpu v0.11.1 // indirect | ||
| github.com/pkg/errors v0.9.1 // indirect | ||
| github.com/rivo/uniseg v0.4.7 // indirect | ||
| golang.org/x/crypto v0.46.0 // indirect | ||
| gopkg.in/yaml.v2 v2.4.0 // indirect | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify which packages are directly imported by non-test Go files in the pdf/ directory
echo "=== Go files in pdf/ directory ==="
find pdf -name "*.go" -type f | head -20
echo ""
echo "=== Direct imports in pdf/pdf.go ==="
if [ -f "pdf/pdf.go" ]; then
rg -n '^\s+"github\.com/' pdf/pdf.go | head -30
else
echo "pdf/pdf.go not found, checking for any Go files in pdf/"
find pdf -name "*.go" -type f -exec head -50 {} \;
fi
echo ""
echo "=== Current go.mod require block ==="
sed -n '/^require (/,/^)/p' go.modRepository: kashifkhan0771/utils
Length of output: 1130
🏁 Script executed:
#!/bin/bash
echo "=== Direct imports in pdf/helpers.go ==="
if [ -f "pdf/helpers.go" ]; then
rg -n '^\s+"github\.com/' pdf/helpers.go
fi
echo ""
echo "=== Direct imports in pdf/pdf_test.go ==="
if [ -f "pdf/pdf_test.go" ]; then
rg -n '^\s+"github\.com/' pdf/pdf_test.go
fiRepository: kashifkhan0771/utils
Length of output: 175
Direct dependencies incorrectly marked as // indirect.
github.com/go-pdf/fpdf, github.com/ledongthuc/pdf, and github.com/pdfcpu/pdfcpu are directly imported in pdf/pdf.go (lines 17–20) and pdf/pdf_test.go (line 9), yet they are listed with // indirect comments in go.mod. Direct imports should not be marked as indirect dependencies.
Running go mod tidy will automatically correct these classifications and move them to the main require block.
🤖 Prompt for AI Agents
In `@go.mod` around lines 15 - 28, The go.mod lists github.com/go-pdf/fpdf,
github.com/ledongthuc/pdf, and github.com/pdfcpu/pdfcpu as indirect even though
they are directly imported in pdf/pdf.go and pdf/pdf_test.go; fix by running "go
mod tidy" to regenerate module requirements so those packages are recorded as
direct requirements (or manually remove the "// indirect" comments and update
the require block), ensuring the imports in pdf/pdf.go and pdf/pdf_test.go match
the go.mod entries.
| for i := 1; i <= pages; i++ { | ||
| p.AddPage() | ||
| p.SetFont("Arial", "", 12) | ||
| p.CellFormat(0, 10, "Page content for page "+string(rune('0'+i)), "", 1, "", false, 0, "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
string(rune('0'+i)) breaks for page counts > 9.
This produces the correct digit character only for i in 1–9. For i >= 10, it yields non-digit characters (e.g., ':' for 10). Use fmt.Sprintf instead.
Proposed fix
- p.CellFormat(0, 10, "Page content for page "+string(rune('0'+i)), "", 1, "", false, 0, "")
+ p.CellFormat(0, 10, fmt.Sprintf("Page content for page %d", i), "", 1, "", false, 0, "")You'll also need to add "fmt" to the import block.
🤖 Prompt for AI Agents
In `@pdf/pdf_test.go` at line 20, The test uses string(rune('0'+i)) inside the
p.CellFormat call which only produces correct single digits for i<10; replace
that expression with a formatted decimal using fmt.Sprintf (e.g., build the page
text with fmt.Sprintf("Page content for page %d", i)) and add "fmt" to the
import block so the test prints correct page numbers for i >= 10; update the
p.CellFormat invocation in pdf_test.go accordingly.
| func renderNode(p *fpdf.Fpdf, n *html.Node) { | ||
| if n.Type == html.TextNode { | ||
| text := strings.TrimSpace(n.Data) | ||
| if text != "" { | ||
| p.Write(6, text+" ") | ||
| } | ||
| return | ||
| } | ||
|
|
||
| if n.Type == html.ElementNode { | ||
| switch n.Data { | ||
| case "h1": | ||
| p.Ln(4) | ||
| p.SetFont("Arial", "B", 24) | ||
| case "h2": | ||
| p.Ln(4) | ||
| p.SetFont("Arial", "B", 20) | ||
| case "h3": | ||
| p.Ln(3) | ||
| p.SetFont("Arial", "B", 16) | ||
| case "h4": | ||
| p.Ln(3) | ||
| p.SetFont("Arial", "B", 14) | ||
| case "h5": | ||
| p.Ln(2) | ||
| p.SetFont("Arial", "B", 12) | ||
| case "h6": | ||
| p.Ln(2) | ||
| p.SetFont("Arial", "B", 10) | ||
| case "p": | ||
| p.Ln(4) | ||
| p.SetFont("Arial", "", 12) | ||
| case "b", "strong": | ||
| p.SetFont("Arial", "B", 12) | ||
| case "i", "em": | ||
| p.SetFont("Arial", "I", 12) | ||
| case "br": | ||
| p.Ln(6) | ||
| case "a": | ||
| p.SetFont("Arial", "U", 12) | ||
| case "img": | ||
| renderImage(p, n) | ||
| return | ||
| } | ||
| } | ||
|
|
||
| for c := n.FirstChild; c != nil; c = c.NextSibling { | ||
| renderNode(p, c) | ||
| } | ||
|
|
||
| // Reset font after block-level elements | ||
| if n.Type == html.ElementNode { | ||
| switch n.Data { | ||
| case "h1", "h2", "h3", "h4", "h5", "h6": | ||
| p.Ln(4) | ||
| p.SetFont("Arial", "", 12) | ||
| case "p": | ||
| p.Ln(4) | ||
| p.SetFont("Arial", "", 12) | ||
| case "b", "strong", "i", "em", "a": | ||
| p.SetFont("Arial", "", 12) | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nested inline elements break font state.
renderNode doesn't save/restore font state — it hard-codes resets to Arial/""/12. This means nested formatting like <p><b>bold <i>bold-italic</i> bold</b></p> won't render correctly: exiting <i> resets to regular (losing bold), and exiting <b> resets again.
A lightweight fix is to track the current font style in a struct and push/pop it on element entry/exit. Alternatively, document this as a known limitation if nested inline formatting isn't a priority.
Minimal sketch using a style stack
// At the top of pdf.go or in a separate helper:
type fontState struct {
family string
style string
size float64
}
// In renderNode, before processing children of inline elements,
// push the current state and pop it after children are processed.This is a non-trivial change, so an acceptable alternative is to document the limitation clearly in the function's godoc and README.
🤖 Prompt for AI Agents
In `@pdf/pdf.go` around lines 64 - 127, renderNode currently resets fonts to a
hard-coded Arial/""/12 which breaks nested inline styles; fix by adding a font
state stack (e.g., type fontState { family, style string; size float64 }) and
modify renderNode to push the current fontState when entering an element (use
for inline tags like "b","strong","i","em","a") and pop/restore it after
processing children, updating calls that set fonts (SetFont) to update the top
of the stack rather than directly resetting to Arial/""/12; ensure renderImage
and block-level resets still set an explicit base state when appropriate.
| func MergePDFs(inputFiles []string, outputFile string) error { | ||
| if len(inputFiles) < 2 { | ||
| return fmt.Errorf("at least two input files are required for merging") | ||
| } | ||
|
|
||
| outputFile = filepath.Clean(outputFile) | ||
|
|
||
| for i, f := range inputFiles { | ||
| inputFiles[i] = filepath.Clean(f) | ||
| if _, err := os.Stat(inputFiles[i]); err != nil { | ||
| return fmt.Errorf("input file %q does not exist: %w", inputFiles[i], err) | ||
| } | ||
| } | ||
|
|
||
| conf := model.NewDefaultConfiguration() | ||
| conf.ValidationMode = model.ValidationRelaxed | ||
|
|
||
| return api.MergeCreateFile(inputFiles, outputFile, false, conf) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MergePDFs mutates the caller's input slice.
Line 202 writes back into inputFiles[i], which modifies the caller's slice in-place. This is an unexpected side effect for a public API.
Proposed fix: work on a copy
func MergePDFs(inputFiles []string, outputFile string) error {
if len(inputFiles) < 2 {
return fmt.Errorf("at least two input files are required for merging")
}
outputFile = filepath.Clean(outputFile)
- for i, f := range inputFiles {
- inputFiles[i] = filepath.Clean(f)
- if _, err := os.Stat(inputFiles[i]); err != nil {
- return fmt.Errorf("input file %q does not exist: %w", inputFiles[i], err)
+ cleaned := make([]string, len(inputFiles))
+ for i, f := range inputFiles {
+ cleaned[i] = filepath.Clean(f)
+ if _, err := os.Stat(cleaned[i]); err != nil {
+ return fmt.Errorf("input file %q does not exist: %w", cleaned[i], err)
}
}
conf := model.NewDefaultConfiguration()
conf.ValidationMode = model.ValidationRelaxed
- return api.MergeCreateFile(inputFiles, outputFile, false, conf)
+ return api.MergeCreateFile(cleaned, outputFile, false, conf)
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func MergePDFs(inputFiles []string, outputFile string) error { | |
| if len(inputFiles) < 2 { | |
| return fmt.Errorf("at least two input files are required for merging") | |
| } | |
| outputFile = filepath.Clean(outputFile) | |
| for i, f := range inputFiles { | |
| inputFiles[i] = filepath.Clean(f) | |
| if _, err := os.Stat(inputFiles[i]); err != nil { | |
| return fmt.Errorf("input file %q does not exist: %w", inputFiles[i], err) | |
| } | |
| } | |
| conf := model.NewDefaultConfiguration() | |
| conf.ValidationMode = model.ValidationRelaxed | |
| return api.MergeCreateFile(inputFiles, outputFile, false, conf) | |
| } | |
| func MergePDFs(inputFiles []string, outputFile string) error { | |
| if len(inputFiles) < 2 { | |
| return fmt.Errorf("at least two input files are required for merging") | |
| } | |
| outputFile = filepath.Clean(outputFile) | |
| cleaned := make([]string, len(inputFiles)) | |
| for i, f := range inputFiles { | |
| cleaned[i] = filepath.Clean(f) | |
| if _, err := os.Stat(cleaned[i]); err != nil { | |
| return fmt.Errorf("input file %q does not exist: %w", cleaned[i], err) | |
| } | |
| } | |
| conf := model.NewDefaultConfiguration() | |
| conf.ValidationMode = model.ValidationRelaxed | |
| return api.MergeCreateFile(cleaned, outputFile, false, conf) | |
| } |
🤖 Prompt for AI Agents
In `@pdf/pdf.go` around lines 194 - 212, MergePDFs currently mutates the caller's
input slice by writing cleaned paths back into inputFiles[i]; instead, create
and use a new local slice (e.g., cleanedInputs or inputsCopy) to store
filepath.Clean results and file existence checks, leaving the original
inputFiles untouched, and then pass that new slice into api.MergeCreateFile;
refer to the MergePDFs function and the inputFiles variable when applying this
change.
Description
Introduces a new
pdfpackage providing essential PDF manipulation utilities for the library:<h1>–<h6>), paragraphs, bold/italic, links, line breaks, and local images."1-3","5","7-10").New Dependencies
github.com/pdfcpu/pdfcpugithub.com/ledongthuc/pdfgithub.com/go-pdf/fpdfFiles Added
Files Modified
pdfrow to utility packages tableChecklist
make test✅make lintSummary by CodeRabbit
New Features
Documentation
Tests