feat: Add PDF utility package with functionalities for HTML conversio… #225

Abin-Antu · 2026-02-11T01:38:22Z

Description

Introduces a new pdf package providing essential PDF manipulation utilities for the library:

ConvertHTMLToPDF(html, outputPath) — Converts basic HTML content into a PDF document. Supports headings (<h1>–<h6>), paragraphs, bold/italic, links, line breaks, and local images.
ExtractTextFromPDF(inputPath) — Extracts plain text content from all pages of a PDF file.
MergePDFs(inputFiles, outputFile) — Combines multiple PDF files into a single document in sequence.
SplitPDF(inputFile, pageRanges, outputDir) — Splits a PDF into smaller files based on page ranges (e.g., "1-3", "5", "7-10").

New Dependencies

Package	Purpose	License
`github.com/pdfcpu/pdfcpu`	PDF merge & split	Apache 2.0
`github.com/ledongthuc/pdf`	Text extraction	BSD
`github.com/go-pdf/fpdf`	HTML → PDF generation	MIT

Files Added

pdf/pdf.go — Main implementation (4 exported functions)
pdf/helpers.go — Internal page range parser
pdf/pdf_test.go — 28 table-driven tests
pdf/README.md — Package documentation
pdf/EXAMPLES.md — Usage examples

Files Modified

go.mod / go.sum — Added new dependencies
README.md — Added pdf row to utility packages table

Checklist

Tests Passing: make test ✅
Golint Passing: make lint
Updated the utility package EXAMPLES.md
Updated the utility package README.md
Updated the main README.md utility packages table

Summary by CodeRabbit

New Features
- Added PDF utilities: convert HTML to PDF, extract text from PDFs, merge multiple PDFs, and split PDFs by page ranges.
Documentation
- Added PDF package documentation with usage examples and feature descriptions.
Tests
- Added comprehensive test suite for all PDF utility functions.

…n, text extraction, merging, and splitting, along with corresponding tests and documentation.

Signed-off-by: Abin Antu <abincantug@gmail.com>

coderabbitai · 2026-02-11T01:39:38Z

Walkthrough

Introduces a new PDF utilities package with four public functions for HTML-to-PDF conversion, text extraction, merging, and splitting PDFs. Includes comprehensive documentation, examples, tests, and updates Go toolchain to 1.24.1 with corresponding dependency updates.

Changes

Cohort / File(s)	Summary
Module Updates `go.mod`	Bumps Go toolchain from 1.24.0 to 1.24.1 and replaces indirect dependency entry with multiline require block introducing new PDF-related dependencies (fpdf, pdfcpu, ledongthuc/pdf, pkcs7, tiff, lzw, etc.).
Documentation `README.md`, `pdf/README.md`, `pdf/EXAMPLES.md`	Adds PDF feature entry to main features table and creates dedicated PDF module documentation with API descriptions and usage examples for ConvertHTMLToPDF, ExtractTextFromPDF, MergePDFs, and SplitPDF functions.
PDF Utilities Implementation `pdf/pdf.go`, `pdf/helpers.go`	Introduces new pdf package with four exported functions: ConvertHTMLToPDF (HTML-to-PDF with element rendering), ExtractTextFromPDF (text extraction from PDF pages), MergePDFs (multiple PDF concatenation), SplitPDF (page range-based splitting). Includes internal helpers for HTML node traversal, image embedding, and page range parsing.
Test Suite `pdf/pdf_test.go`	Comprehensive test coverage for all PDF utilities including HTML-to-PDF conversion variants, text extraction scenarios, PDF merging with invalid inputs, page splitting with single and multiple ranges, and page range parsing edge cases.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

[FEATURE] PDF Utilities #96: This PR directly implements the PDF utilities package with all four core functions (ConvertHTMLToPDF, ExtractTextFromPDF, MergePDFs, SplitPDF) and supporting infrastructure that fulfill the objectives described in this related issue.

Suggested labels

documentation

Suggested reviewers

shahzadhaider1
kashifkhan0771

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 64.29% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main change: adding a new PDF utility package with HTML conversion functionality, which aligns with the changeset's core objective.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Abin-Antu · 2026-02-11T01:40:08Z

@kashifkhan0771

coderabbitai

Actionable comments posted: 4

🤖 Fix all issues with AI agents

In `@go.mod`:
- Around line 15-28: The go.mod lists github.com/go-pdf/fpdf,
github.com/ledongthuc/pdf, and github.com/pdfcpu/pdfcpu as indirect even though
they are directly imported in pdf/pdf.go and pdf/pdf_test.go; fix by running "go
mod tidy" to regenerate module requirements so those packages are recorded as
direct requirements (or manually remove the "// indirect" comments and update
the require block), ensuring the imports in pdf/pdf.go and pdf/pdf_test.go match
the go.mod entries.

In `@pdf/pdf_test.go`:
- Line 20: The test uses string(rune('0'+i)) inside the p.CellFormat call which
only produces correct single digits for i<10; replace that expression with a
formatted decimal using fmt.Sprintf (e.g., build the page text with
fmt.Sprintf("Page content for page %d", i)) and add "fmt" to the import block so
the test prints correct page numbers for i >= 10; update the p.CellFormat
invocation in pdf_test.go accordingly.

In `@pdf/pdf.go`:
- Around line 64-127: renderNode currently resets fonts to a hard-coded
Arial/""/12 which breaks nested inline styles; fix by adding a font state stack
(e.g., type fontState { family, style string; size float64 }) and modify
renderNode to push the current fontState when entering an element (use for
inline tags like "b","strong","i","em","a") and pop/restore it after processing
children, updating calls that set fonts (SetFont) to update the top of the stack
rather than directly resetting to Arial/""/12; ensure renderImage and
block-level resets still set an explicit base state when appropriate.
- Around line 194-212: MergePDFs currently mutates the caller's input slice by
writing cleaned paths back into inputFiles[i]; instead, create and use a new
local slice (e.g., cleanedInputs or inputsCopy) to store filepath.Clean results
and file existence checks, leaving the original inputFiles untouched, and then
pass that new slice into api.MergeCreateFile; refer to the MergePDFs function
and the inputFiles variable when applying this change.

🧹 Nitpick comments (3)

pdf/pdf_test.go (1)

275-376: Good coverage for SplitPDF.

Covers single range, multiple ranges, single page, empty/invalid/reversed ranges, and nonexistent file handling. Consider adding a test for a page range that exceeds the PDF's actual page count to verify how pdfcpu.TrimFile behaves in that scenario.

pdf/pdf.go (1)

130-152: renderImage silently swallows errors — consider logging or returning an error indicator.

When src is empty (line 139) or the file doesn't exist (line 146), the function silently returns without any indication to the caller. For a utility library, silently skipping broken image references may surprise users. Consider at minimum documenting this behavior, or returning a boolean/error that renderNode can act on.

Also, note that filepath.Clean alone does not restrict the path to a safe directory — if untrusted HTML is ever processed, this could read arbitrary local files. Worth documenting that ConvertHTMLToPDF should only be used with trusted HTML content.

pdf/README.md (1)

1-12: LGTM — consider adding a brief "Limitations" section.

The README accurately documents the four functions. You may want to add a short note about limitations (no CSS/JS support, basic HTML only) to set expectations, since ConvertHTMLToPDF's godoc mentions these constraints but the README doesn't.

coderabbitai · 2026-02-11T01:43:18Z

go.mod

+require (
+	github.com/clipperhouse/uax29/v2 v2.2.0 // indirect
+	github.com/go-pdf/fpdf v0.9.0 // indirect
+	github.com/hhrutter/lzw v1.0.0 // indirect
+	github.com/hhrutter/pkcs7 v0.2.0 // indirect
+	github.com/hhrutter/tiff v1.0.2 // indirect
+	github.com/ledongthuc/pdf v0.0.0-20250511090121-5959a4027728 // indirect
+	github.com/mattn/go-runewidth v0.0.19 // indirect
+	github.com/pdfcpu/pdfcpu v0.11.1 // indirect
+	github.com/pkg/errors v0.9.1 // indirect
+	github.com/rivo/uniseg v0.4.7 // indirect
+	golang.org/x/crypto v0.46.0 // indirect
+	gopkg.in/yaml.v2 v2.4.0 // indirect
+)


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify which packages are directly imported by non-test Go files in the pdf/ directory echo "=== Go files in pdf/ directory ===" find pdf -name "*.go" -type f | head -20 echo "" echo "=== Direct imports in pdf/pdf.go ===" if [ -f "pdf/pdf.go" ]; then rg -n '^\s+"github\.com/' pdf/pdf.go | head -30 else echo "pdf/pdf.go not found, checking for any Go files in pdf/" find pdf -name "*.go" -type f -exec head -50 {} \; fi echo "" echo "=== Current go.mod require block ===" sed -n '/^require (/,/^)/p' go.mod

Repository: kashifkhan0771/utils

Length of output: 1130

🏁 Script executed:

#!/bin/bash echo "=== Direct imports in pdf/helpers.go ===" if [ -f "pdf/helpers.go" ]; then rg -n '^\s+"github\.com/' pdf/helpers.go fi echo "" echo "=== Direct imports in pdf/pdf_test.go ===" if [ -f "pdf/pdf_test.go" ]; then rg -n '^\s+"github\.com/' pdf/pdf_test.go fi

Repository: kashifkhan0771/utils

Length of output: 175

Direct dependencies incorrectly marked as // indirect.

github.com/go-pdf/fpdf, github.com/ledongthuc/pdf, and github.com/pdfcpu/pdfcpu are directly imported in pdf/pdf.go (lines 17–20) and pdf/pdf_test.go (line 9), yet they are listed with // indirect comments in go.mod. Direct imports should not be marked as indirect dependencies.

Running go mod tidy will automatically correct these classifications and move them to the main require block.

🤖 Prompt for AI Agents

In `@go.mod` around lines 15 - 28, The go.mod lists github.com/go-pdf/fpdf, github.com/ledongthuc/pdf, and github.com/pdfcpu/pdfcpu as indirect even though they are directly imported in pdf/pdf.go and pdf/pdf_test.go; fix by running "go mod tidy" to regenerate module requirements so those packages are recorded as direct requirements (or manually remove the "// indirect" comments and update the require block), ensuring the imports in pdf/pdf.go and pdf/pdf_test.go match the go.mod entries.

coderabbitai · 2026-02-11T01:43:18Z

pdf/pdf_test.go

+	for i := 1; i <= pages; i++ {
+		p.AddPage()
+		p.SetFont("Arial", "", 12)
+		p.CellFormat(0, 10, "Page content for page "+string(rune('0'+i)), "", 1, "", false, 0, "")


⚠️ Potential issue | 🟡 Minor

string(rune('0'+i)) breaks for page counts > 9.

This produces the correct digit character only for i in 1–9. For i >= 10, it yields non-digit characters (e.g., ':' for 10). Use fmt.Sprintf instead.

Proposed fix

- p.CellFormat(0, 10, "Page content for page "+string(rune('0'+i)), "", 1, "", false, 0, "") + p.CellFormat(0, 10, fmt.Sprintf("Page content for page %d", i), "", 1, "", false, 0, "")

You'll also need to add "fmt" to the import block.

🤖 Prompt for AI Agents

In `@pdf/pdf_test.go` at line 20, The test uses string(rune('0'+i)) inside the p.CellFormat call which only produces correct single digits for i<10; replace that expression with a formatted decimal using fmt.Sprintf (e.g., build the page text with fmt.Sprintf("Page content for page %d", i)) and add "fmt" to the import block so the test prints correct page numbers for i >= 10; update the p.CellFormat invocation in pdf_test.go accordingly.

coderabbitai · 2026-02-11T01:43:18Z

pdf/pdf.go

+func renderNode(p *fpdf.Fpdf, n *html.Node) {
+	if n.Type == html.TextNode {
+		text := strings.TrimSpace(n.Data)
+		if text != "" {
+			p.Write(6, text+" ")
+		}
+		return
+	}
+
+	if n.Type == html.ElementNode {
+		switch n.Data {
+		case "h1":
+			p.Ln(4)
+			p.SetFont("Arial", "B", 24)
+		case "h2":
+			p.Ln(4)
+			p.SetFont("Arial", "B", 20)
+		case "h3":
+			p.Ln(3)
+			p.SetFont("Arial", "B", 16)
+		case "h4":
+			p.Ln(3)
+			p.SetFont("Arial", "B", 14)
+		case "h5":
+			p.Ln(2)
+			p.SetFont("Arial", "B", 12)
+		case "h6":
+			p.Ln(2)
+			p.SetFont("Arial", "B", 10)
+		case "p":
+			p.Ln(4)
+			p.SetFont("Arial", "", 12)
+		case "b", "strong":
+			p.SetFont("Arial", "B", 12)
+		case "i", "em":
+			p.SetFont("Arial", "I", 12)
+		case "br":
+			p.Ln(6)
+		case "a":
+			p.SetFont("Arial", "U", 12)
+		case "img":
+			renderImage(p, n)
+			return
+		}
+	}
+
+	for c := n.FirstChild; c != nil; c = c.NextSibling {
+		renderNode(p, c)
+	}
+
+	// Reset font after block-level elements
+	if n.Type == html.ElementNode {
+		switch n.Data {
+		case "h1", "h2", "h3", "h4", "h5", "h6":
+			p.Ln(4)
+			p.SetFont("Arial", "", 12)
+		case "p":
+			p.Ln(4)
+			p.SetFont("Arial", "", 12)
+		case "b", "strong", "i", "em", "a":
+			p.SetFont("Arial", "", 12)
+		}
+	}
+}


⚠️ Potential issue | 🟠 Major

Nested inline elements break font state.

renderNode doesn't save/restore font state — it hard-codes resets to Arial/""/12. This means nested formatting like <p><b>bold <i>bold-italic</i> bold</b></p> won't render correctly: exiting <i> resets to regular (losing bold), and exiting <b> resets again.

A lightweight fix is to track the current font style in a struct and push/pop it on element entry/exit. Alternatively, document this as a known limitation if nested inline formatting isn't a priority.

Minimal sketch using a style stack

// At the top of pdf.go or in a separate helper: type fontState struct { family string style string size float64 } // In renderNode, before processing children of inline elements, // push the current state and pop it after children are processed.

This is a non-trivial change, so an acceptable alternative is to document the limitation clearly in the function's godoc and README.

🤖 Prompt for AI Agents

In `@pdf/pdf.go` around lines 64 - 127, renderNode currently resets fonts to a hard-coded Arial/""/12 which breaks nested inline styles; fix by adding a font state stack (e.g., type fontState { family, style string; size float64 }) and modify renderNode to push the current fontState when entering an element (use for inline tags like "b","strong","i","em","a") and pop/restore it after processing children, updating calls that set fonts (SetFont) to update the top of the stack rather than directly resetting to Arial/""/12; ensure renderImage and block-level resets still set an explicit base state when appropriate.

coderabbitai · 2026-02-11T01:43:18Z

pdf/pdf.go

+func MergePDFs(inputFiles []string, outputFile string) error {
+	if len(inputFiles) < 2 {
+		return fmt.Errorf("at least two input files are required for merging")
+	}
+
+	outputFile = filepath.Clean(outputFile)
+
+	for i, f := range inputFiles {
+		inputFiles[i] = filepath.Clean(f)
+		if _, err := os.Stat(inputFiles[i]); err != nil {
+			return fmt.Errorf("input file %q does not exist: %w", inputFiles[i], err)
+		}
+	}
+
+	conf := model.NewDefaultConfiguration()
+	conf.ValidationMode = model.ValidationRelaxed
+
+	return api.MergeCreateFile(inputFiles, outputFile, false, conf)
+}


⚠️ Potential issue | 🟡 Minor

MergePDFs mutates the caller's input slice.

Line 202 writes back into inputFiles[i], which modifies the caller's slice in-place. This is an unexpected side effect for a public API.

Proposed fix: work on a copy

func MergePDFs(inputFiles []string, outputFile string) error { if len(inputFiles) < 2 { return fmt.Errorf("at least two input files are required for merging") } outputFile = filepath.Clean(outputFile) - for i, f := range inputFiles { - inputFiles[i] = filepath.Clean(f) - if _, err := os.Stat(inputFiles[i]); err != nil { - return fmt.Errorf("input file %q does not exist: %w", inputFiles[i], err) + cleaned := make([]string, len(inputFiles)) + for i, f := range inputFiles { + cleaned[i] = filepath.Clean(f) + if _, err := os.Stat(cleaned[i]); err != nil { + return fmt.Errorf("input file %q does not exist: %w", cleaned[i], err) } } conf := model.NewDefaultConfiguration() conf.ValidationMode = model.ValidationRelaxed - return api.MergeCreateFile(inputFiles, outputFile, false, conf) + return api.MergeCreateFile(cleaned, outputFile, false, conf) }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

func MergePDFs(inputFiles []string, outputFile string) error {

if len(inputFiles) < 2 {

return fmt.Errorf("at least two input files are required for merging")

}

outputFile = filepath.Clean(outputFile)

for i, f := range inputFiles {

inputFiles[i] = filepath.Clean(f)

if _, err := os.Stat(inputFiles[i]); err != nil {

return fmt.Errorf("input file %q does not exist: %w", inputFiles[i], err)

}

}

conf := model.NewDefaultConfiguration()

conf.ValidationMode = model.ValidationRelaxed

return api.MergeCreateFile(inputFiles, outputFile, false, conf)

}

func MergePDFs(inputFiles []string, outputFile string) error {

if len(inputFiles) < 2 {

return fmt.Errorf("at least two input files are required for merging")

}

outputFile = filepath.Clean(outputFile)

cleaned := make([]string, len(inputFiles))

for i, f := range inputFiles {

cleaned[i] = filepath.Clean(f)

if _, err := os.Stat(cleaned[i]); err != nil {

return fmt.Errorf("input file %q does not exist: %w", cleaned[i], err)

}

}

conf := model.NewDefaultConfiguration()

conf.ValidationMode = model.ValidationRelaxed

return api.MergeCreateFile(cleaned, outputFile, false, conf)

}

🤖 Prompt for AI Agents

In `@pdf/pdf.go` around lines 194 - 212, MergePDFs currently mutates the caller's input slice by writing cleaned paths back into inputFiles[i]; instead, create and use a new local slice (e.g., cleanedInputs or inputsCopy) to store filepath.Clean results and file existence checks, leaving the original inputFiles untouched, and then pass that new slice into api.MergeCreateFile; refer to the MergePDFs function and the inputFiles variable when applying this change.

Abin-Antu added 2 commits February 11, 2026 01:35

feat: Add PDF utility package with functionalities for HTML conversio…

e56ea43

…n, text extraction, merging, and splitting, along with corresponding tests and documentation.

Merge branch 'main' into pdf

d7307b8

Signed-off-by: Abin Antu <abincantug@gmail.com>

coderabbitai bot requested changes Feb 11, 2026

View reviewed changes

kashifkhan0771 linked an issue Feb 11, 2026 that may be closed by this pull request

[FEATURE] PDF Utilities #96

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add PDF utility package with functionalities for HTML conversio… #225

feat: Add PDF utility package with functionalities for HTML conversio… #225

Abin-Antu commented Feb 11, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 11, 2026 •

edited

Loading

Uh oh!

Abin-Antu commented Feb 11, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 11, 2026

Uh oh!

coderabbitai bot Feb 11, 2026

Uh oh!

coderabbitai bot Feb 11, 2026

Uh oh!

coderabbitai bot Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: Add PDF utility package with functionalities for HTML conversio… #225

Are you sure you want to change the base?

feat: Add PDF utility package with functionalities for HTML conversio… #225

Conversation

Abin-Antu commented Feb 11, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

New Dependencies

Files Added

Files Modified

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Suggested labels

Suggested reviewers

Uh oh!

Abin-Antu commented Feb 11, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Abin-Antu commented Feb 11, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 11, 2026 •

edited

Loading