Skip to content

Conversation

@Abin-Antu
Copy link
Contributor

@Abin-Antu Abin-Antu commented Feb 11, 2026

Description

Introduces a new pdf package providing essential PDF manipulation utilities for the library:

  • ConvertHTMLToPDF(html, outputPath) — Converts basic HTML content into a PDF document. Supports headings (<h1><h6>), paragraphs, bold/italic, links, line breaks, and local images.
  • ExtractTextFromPDF(inputPath) — Extracts plain text content from all pages of a PDF file.
  • MergePDFs(inputFiles, outputFile) — Combines multiple PDF files into a single document in sequence.
  • SplitPDF(inputFile, pageRanges, outputDir) — Splits a PDF into smaller files based on page ranges (e.g., "1-3", "5", "7-10").

New Dependencies

Package Purpose License
github.com/pdfcpu/pdfcpu PDF merge & split Apache 2.0
github.com/ledongthuc/pdf Text extraction BSD
github.com/go-pdf/fpdf HTML → PDF generation MIT

Files Added

  • pdf/pdf.go — Main implementation (4 exported functions)
  • pdf/helpers.go — Internal page range parser
  • pdf/pdf_test.go — 28 table-driven tests
  • pdf/README.md — Package documentation
  • pdf/EXAMPLES.md — Usage examples

Files Modified

  • go.mod / go.sum — Added new dependencies
  • README.md — Added pdf row to utility packages table

Checklist

  • Tests Passing: make test
  • Golint Passing: make lint
  • Updated the utility package EXAMPLES.md
  • Updated the utility package README.md
  • Updated the main README.md utility packages table

Summary by CodeRabbit

  • New Features

    • Added PDF utilities: convert HTML to PDF, extract text from PDFs, merge multiple PDFs, and split PDFs by page ranges.
  • Documentation

    • Added PDF package documentation with usage examples and feature descriptions.
  • Tests

    • Added comprehensive test suite for all PDF utility functions.

…n, text extraction, merging, and splitting, along with corresponding tests and documentation.
Signed-off-by: Abin Antu <abincantug@gmail.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 11, 2026

Walkthrough

Introduces a new PDF utilities package with four public functions for HTML-to-PDF conversion, text extraction, merging, and splitting PDFs. Includes comprehensive documentation, examples, tests, and updates Go toolchain to 1.24.1 with corresponding dependency updates.

Changes

Cohort / File(s) Summary
Module Updates
go.mod
Bumps Go toolchain from 1.24.0 to 1.24.1 and replaces indirect dependency entry with multiline require block introducing new PDF-related dependencies (fpdf, pdfcpu, ledongthuc/pdf, pkcs7, tiff, lzw, etc.).
Documentation
README.md, pdf/README.md, pdf/EXAMPLES.md
Adds PDF feature entry to main features table and creates dedicated PDF module documentation with API descriptions and usage examples for ConvertHTMLToPDF, ExtractTextFromPDF, MergePDFs, and SplitPDF functions.
PDF Utilities Implementation
pdf/pdf.go, pdf/helpers.go
Introduces new pdf package with four exported functions: ConvertHTMLToPDF (HTML-to-PDF with element rendering), ExtractTextFromPDF (text extraction from PDF pages), MergePDFs (multiple PDF concatenation), SplitPDF (page range-based splitting). Includes internal helpers for HTML node traversal, image embedding, and page range parsing.
Test Suite
pdf/pdf_test.go
Comprehensive test coverage for all PDF utilities including HTML-to-PDF conversion variants, text extraction scenarios, PDF merging with invalid inputs, page splitting with single and multiple ranges, and page range parsing edge cases.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

  • [FEATURE] PDF Utilities #96: This PR directly implements the PDF utilities package with all four core functions (ConvertHTMLToPDF, ExtractTextFromPDF, MergePDFs, SplitPDF) and supporting infrastructure that fulfill the objectives described in this related issue.

Suggested labels

documentation

Suggested reviewers

  • shahzadhaider1
  • kashifkhan0771
🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 64.29% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: adding a new PDF utility package with HTML conversion functionality, which aligns with the changeset's core objective.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Abin-Antu
Copy link
Contributor Author

@kashifkhan0771

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@go.mod`:
- Around line 15-28: The go.mod lists github.com/go-pdf/fpdf,
github.com/ledongthuc/pdf, and github.com/pdfcpu/pdfcpu as indirect even though
they are directly imported in pdf/pdf.go and pdf/pdf_test.go; fix by running "go
mod tidy" to regenerate module requirements so those packages are recorded as
direct requirements (or manually remove the "// indirect" comments and update
the require block), ensuring the imports in pdf/pdf.go and pdf/pdf_test.go match
the go.mod entries.

In `@pdf/pdf_test.go`:
- Line 20: The test uses string(rune('0'+i)) inside the p.CellFormat call which
only produces correct single digits for i<10; replace that expression with a
formatted decimal using fmt.Sprintf (e.g., build the page text with
fmt.Sprintf("Page content for page %d", i)) and add "fmt" to the import block so
the test prints correct page numbers for i >= 10; update the p.CellFormat
invocation in pdf_test.go accordingly.

In `@pdf/pdf.go`:
- Around line 64-127: renderNode currently resets fonts to a hard-coded
Arial/""/12 which breaks nested inline styles; fix by adding a font state stack
(e.g., type fontState { family, style string; size float64 }) and modify
renderNode to push the current fontState when entering an element (use for
inline tags like "b","strong","i","em","a") and pop/restore it after processing
children, updating calls that set fonts (SetFont) to update the top of the stack
rather than directly resetting to Arial/""/12; ensure renderImage and
block-level resets still set an explicit base state when appropriate.
- Around line 194-212: MergePDFs currently mutates the caller's input slice by
writing cleaned paths back into inputFiles[i]; instead, create and use a new
local slice (e.g., cleanedInputs or inputsCopy) to store filepath.Clean results
and file existence checks, leaving the original inputFiles untouched, and then
pass that new slice into api.MergeCreateFile; refer to the MergePDFs function
and the inputFiles variable when applying this change.
🧹 Nitpick comments (3)
pdf/pdf_test.go (1)

275-376: Good coverage for SplitPDF.

Covers single range, multiple ranges, single page, empty/invalid/reversed ranges, and nonexistent file handling. Consider adding a test for a page range that exceeds the PDF's actual page count to verify how pdfcpu.TrimFile behaves in that scenario.

pdf/pdf.go (1)

130-152: renderImage silently swallows errors — consider logging or returning an error indicator.

When src is empty (line 139) or the file doesn't exist (line 146), the function silently returns without any indication to the caller. For a utility library, silently skipping broken image references may surprise users. Consider at minimum documenting this behavior, or returning a boolean/error that renderNode can act on.

Also, note that filepath.Clean alone does not restrict the path to a safe directory — if untrusted HTML is ever processed, this could read arbitrary local files. Worth documenting that ConvertHTMLToPDF should only be used with trusted HTML content.

pdf/README.md (1)

1-12: LGTM — consider adding a brief "Limitations" section.

The README accurately documents the four functions. You may want to add a short note about limitations (no CSS/JS support, basic HTML only) to set expectations, since ConvertHTMLToPDF's godoc mentions these constraints but the README doesn't.

Comment on lines +15 to +28
require (
github.com/clipperhouse/uax29/v2 v2.2.0 // indirect
github.com/go-pdf/fpdf v0.9.0 // indirect
github.com/hhrutter/lzw v1.0.0 // indirect
github.com/hhrutter/pkcs7 v0.2.0 // indirect
github.com/hhrutter/tiff v1.0.2 // indirect
github.com/ledongthuc/pdf v0.0.0-20250511090121-5959a4027728 // indirect
github.com/mattn/go-runewidth v0.0.19 // indirect
github.com/pdfcpu/pdfcpu v0.11.1 // indirect
github.com/pkg/errors v0.9.1 // indirect
github.com/rivo/uniseg v0.4.7 // indirect
golang.org/x/crypto v0.46.0 // indirect
gopkg.in/yaml.v2 v2.4.0 // indirect
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify which packages are directly imported by non-test Go files in the pdf/ directory
echo "=== Go files in pdf/ directory ==="
find pdf -name "*.go" -type f | head -20

echo ""
echo "=== Direct imports in pdf/pdf.go ==="
if [ -f "pdf/pdf.go" ]; then
  rg -n '^\s+"github\.com/' pdf/pdf.go | head -30
else
  echo "pdf/pdf.go not found, checking for any Go files in pdf/"
  find pdf -name "*.go" -type f -exec head -50 {} \;
fi

echo ""
echo "=== Current go.mod require block ==="
sed -n '/^require (/,/^)/p' go.mod

Repository: kashifkhan0771/utils

Length of output: 1130


🏁 Script executed:

#!/bin/bash
echo "=== Direct imports in pdf/helpers.go ==="
if [ -f "pdf/helpers.go" ]; then
  rg -n '^\s+"github\.com/' pdf/helpers.go
fi

echo ""
echo "=== Direct imports in pdf/pdf_test.go ==="
if [ -f "pdf/pdf_test.go" ]; then
  rg -n '^\s+"github\.com/' pdf/pdf_test.go
fi

Repository: kashifkhan0771/utils

Length of output: 175


Direct dependencies incorrectly marked as // indirect.

github.com/go-pdf/fpdf, github.com/ledongthuc/pdf, and github.com/pdfcpu/pdfcpu are directly imported in pdf/pdf.go (lines 17–20) and pdf/pdf_test.go (line 9), yet they are listed with // indirect comments in go.mod. Direct imports should not be marked as indirect dependencies.

Running go mod tidy will automatically correct these classifications and move them to the main require block.

🤖 Prompt for AI Agents
In `@go.mod` around lines 15 - 28, The go.mod lists github.com/go-pdf/fpdf,
github.com/ledongthuc/pdf, and github.com/pdfcpu/pdfcpu as indirect even though
they are directly imported in pdf/pdf.go and pdf/pdf_test.go; fix by running "go
mod tidy" to regenerate module requirements so those packages are recorded as
direct requirements (or manually remove the "// indirect" comments and update
the require block), ensuring the imports in pdf/pdf.go and pdf/pdf_test.go match
the go.mod entries.

for i := 1; i <= pages; i++ {
p.AddPage()
p.SetFont("Arial", "", 12)
p.CellFormat(0, 10, "Page content for page "+string(rune('0'+i)), "", 1, "", false, 0, "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

string(rune('0'+i)) breaks for page counts > 9.

This produces the correct digit character only for i in 1–9. For i >= 10, it yields non-digit characters (e.g., ':' for 10). Use fmt.Sprintf instead.

Proposed fix
-		p.CellFormat(0, 10, "Page content for page "+string(rune('0'+i)), "", 1, "", false, 0, "")
+		p.CellFormat(0, 10, fmt.Sprintf("Page content for page %d", i), "", 1, "", false, 0, "")

You'll also need to add "fmt" to the import block.

🤖 Prompt for AI Agents
In `@pdf/pdf_test.go` at line 20, The test uses string(rune('0'+i)) inside the
p.CellFormat call which only produces correct single digits for i<10; replace
that expression with a formatted decimal using fmt.Sprintf (e.g., build the page
text with fmt.Sprintf("Page content for page %d", i)) and add "fmt" to the
import block so the test prints correct page numbers for i >= 10; update the
p.CellFormat invocation in pdf_test.go accordingly.

Comment on lines +64 to +127
func renderNode(p *fpdf.Fpdf, n *html.Node) {
if n.Type == html.TextNode {
text := strings.TrimSpace(n.Data)
if text != "" {
p.Write(6, text+" ")
}
return
}

if n.Type == html.ElementNode {
switch n.Data {
case "h1":
p.Ln(4)
p.SetFont("Arial", "B", 24)
case "h2":
p.Ln(4)
p.SetFont("Arial", "B", 20)
case "h3":
p.Ln(3)
p.SetFont("Arial", "B", 16)
case "h4":
p.Ln(3)
p.SetFont("Arial", "B", 14)
case "h5":
p.Ln(2)
p.SetFont("Arial", "B", 12)
case "h6":
p.Ln(2)
p.SetFont("Arial", "B", 10)
case "p":
p.Ln(4)
p.SetFont("Arial", "", 12)
case "b", "strong":
p.SetFont("Arial", "B", 12)
case "i", "em":
p.SetFont("Arial", "I", 12)
case "br":
p.Ln(6)
case "a":
p.SetFont("Arial", "U", 12)
case "img":
renderImage(p, n)
return
}
}

for c := n.FirstChild; c != nil; c = c.NextSibling {
renderNode(p, c)
}

// Reset font after block-level elements
if n.Type == html.ElementNode {
switch n.Data {
case "h1", "h2", "h3", "h4", "h5", "h6":
p.Ln(4)
p.SetFont("Arial", "", 12)
case "p":
p.Ln(4)
p.SetFont("Arial", "", 12)
case "b", "strong", "i", "em", "a":
p.SetFont("Arial", "", 12)
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Nested inline elements break font state.

renderNode doesn't save/restore font state — it hard-codes resets to Arial/""/12. This means nested formatting like <p><b>bold <i>bold-italic</i> bold</b></p> won't render correctly: exiting <i> resets to regular (losing bold), and exiting <b> resets again.

A lightweight fix is to track the current font style in a struct and push/pop it on element entry/exit. Alternatively, document this as a known limitation if nested inline formatting isn't a priority.

Minimal sketch using a style stack
// At the top of pdf.go or in a separate helper:
type fontState struct {
	family string
	style  string
	size   float64
}

// In renderNode, before processing children of inline elements,
// push the current state and pop it after children are processed.

This is a non-trivial change, so an acceptable alternative is to document the limitation clearly in the function's godoc and README.

🤖 Prompt for AI Agents
In `@pdf/pdf.go` around lines 64 - 127, renderNode currently resets fonts to a
hard-coded Arial/""/12 which breaks nested inline styles; fix by adding a font
state stack (e.g., type fontState { family, style string; size float64 }) and
modify renderNode to push the current fontState when entering an element (use
for inline tags like "b","strong","i","em","a") and pop/restore it after
processing children, updating calls that set fonts (SetFont) to update the top
of the stack rather than directly resetting to Arial/""/12; ensure renderImage
and block-level resets still set an explicit base state when appropriate.

Comment on lines +194 to +212
func MergePDFs(inputFiles []string, outputFile string) error {
if len(inputFiles) < 2 {
return fmt.Errorf("at least two input files are required for merging")
}

outputFile = filepath.Clean(outputFile)

for i, f := range inputFiles {
inputFiles[i] = filepath.Clean(f)
if _, err := os.Stat(inputFiles[i]); err != nil {
return fmt.Errorf("input file %q does not exist: %w", inputFiles[i], err)
}
}

conf := model.NewDefaultConfiguration()
conf.ValidationMode = model.ValidationRelaxed

return api.MergeCreateFile(inputFiles, outputFile, false, conf)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

MergePDFs mutates the caller's input slice.

Line 202 writes back into inputFiles[i], which modifies the caller's slice in-place. This is an unexpected side effect for a public API.

Proposed fix: work on a copy
 func MergePDFs(inputFiles []string, outputFile string) error {
 	if len(inputFiles) < 2 {
 		return fmt.Errorf("at least two input files are required for merging")
 	}
 
 	outputFile = filepath.Clean(outputFile)
 
-	for i, f := range inputFiles {
-		inputFiles[i] = filepath.Clean(f)
-		if _, err := os.Stat(inputFiles[i]); err != nil {
-			return fmt.Errorf("input file %q does not exist: %w", inputFiles[i], err)
+	cleaned := make([]string, len(inputFiles))
+	for i, f := range inputFiles {
+		cleaned[i] = filepath.Clean(f)
+		if _, err := os.Stat(cleaned[i]); err != nil {
+			return fmt.Errorf("input file %q does not exist: %w", cleaned[i], err)
 		}
 	}
 
 	conf := model.NewDefaultConfiguration()
 	conf.ValidationMode = model.ValidationRelaxed
 
-	return api.MergeCreateFile(inputFiles, outputFile, false, conf)
+	return api.MergeCreateFile(cleaned, outputFile, false, conf)
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func MergePDFs(inputFiles []string, outputFile string) error {
if len(inputFiles) < 2 {
return fmt.Errorf("at least two input files are required for merging")
}
outputFile = filepath.Clean(outputFile)
for i, f := range inputFiles {
inputFiles[i] = filepath.Clean(f)
if _, err := os.Stat(inputFiles[i]); err != nil {
return fmt.Errorf("input file %q does not exist: %w", inputFiles[i], err)
}
}
conf := model.NewDefaultConfiguration()
conf.ValidationMode = model.ValidationRelaxed
return api.MergeCreateFile(inputFiles, outputFile, false, conf)
}
func MergePDFs(inputFiles []string, outputFile string) error {
if len(inputFiles) < 2 {
return fmt.Errorf("at least two input files are required for merging")
}
outputFile = filepath.Clean(outputFile)
cleaned := make([]string, len(inputFiles))
for i, f := range inputFiles {
cleaned[i] = filepath.Clean(f)
if _, err := os.Stat(cleaned[i]); err != nil {
return fmt.Errorf("input file %q does not exist: %w", cleaned[i], err)
}
}
conf := model.NewDefaultConfiguration()
conf.ValidationMode = model.ValidationRelaxed
return api.MergeCreateFile(cleaned, outputFile, false, conf)
}
🤖 Prompt for AI Agents
In `@pdf/pdf.go` around lines 194 - 212, MergePDFs currently mutates the caller's
input slice by writing cleaned paths back into inputFiles[i]; instead, create
and use a new local slice (e.g., cleanedInputs or inputsCopy) to store
filepath.Clean results and file existence checks, leaving the original
inputFiles untouched, and then pass that new slice into api.MergeCreateFile;
refer to the MergePDFs function and the inputFiles variable when applying this
change.

@kashifkhan0771 kashifkhan0771 linked an issue Feb 11, 2026 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] PDF Utilities

1 participant