Goodies: The All-in-One Web Scraper & HTML-to-Markdown Converter

Golly & Gong are two peas in a pod, which is to say, I've been too lazy to merge them into a properly powerful HTTP fetching, HTML formulating, and Markdown generating pure Golang web-scraper. Golly scrapes the link, returns HTML  Gong converts that raw HTML into purified Markdown. Surprisingly, Gong in particular has operated considerably better than literally every other solution I've tried from within the realm of HTML-to-Markdown conversion. I didn't intend on building the most feugo, supreme Markdown generator (it probably isn't I know), but no shit it works better than the other shit I've tried.

Both project summaries are found in their respective XML files, aptly named.

Goodies is a powerful Go-based command-line tool that combines web scraping capabilities with HTML-to-Markdown conversion. It can scrape web pages with fine-grained control, extract specific content using CSS selectors, and convert HTML content (from web pages or local files) into clean Markdown format.

Features

🌐 Web Scraping: Fetch and extract content from websites
🎯 CSS Selector Targeting: Extract specific DOM elements using CSS selectors
📁 Local File Processing: Convert local HTML files to Markdown
📂 Batch Processing: Recursively convert entire directories of HTML files
🔄 Multiple Output Formats: JSON, plain text, raw content, HTML, complete HTML (with inlined resources), and Markdown
⚡ Concurrent Scraping: Parallel processing with configurable limits
🎨 Resource Inlining: Automatically inline CSS and JavaScript for complete HTML output
📝 Flexible Input: Accept URLs, local files, or stdin input
🛡️ Robust Error Handling: Graceful error recovery and logging

Installation

Prerequisites

Go 1.16 or higher

From Source

git clone <repository-url>
cd goodies
go build -o goodies

Using go install

go install github.com/your-username/goodies@latest

Usage

Basic Syntax

goodies [flags] <URL|FILE|DIR>

Flags

Flag	Description	Default
`-o`	Output file path	(stdout)
`-r`	Recursively process directories (Markdown conversion mode only)	false
`-s`	CSS selector to target (e.g., 'article', '#content')	""
`-a`	User Agent string	"Mozilla/5.0 (Compatible; Goodies/1.0)"
`-f`	Output format: `complete`, `html`, `text`, `json`, `raw`, `md`	`complete`

Usage Examples

1. Basic Web Scraping

Scrape a webpage and output complete HTML (default):

goodies https://example.com

Scrape with a custom user agent:

goodies -a "MyBot/1.0" https://example.com

2. Targeted Content Extraction

Extract only the article content:

goodies -s "article" https://news-site.com/article-123

Extract content from a specific div:

goodies -s "#main-content" https://blog.example.com/post

3. Different Output Formats

JSON output (full structured data):

goodies -f json https://example.com

Plain text output:

goodies -f text https://example.com

Raw content only:

goodies -f raw https://example.com

Original HTML:

goodies -f html https://example.com

Complete HTML (with inlined resources):

goodies -f complete https://example.com

4. Convert to Markdown

Scrape and convert to Markdown:

goodies -f md https://example.com/blog/post

Scrape specific content and convert to Markdown:

goodies -s ".post-content" -f md https://example.com/blog/post

Save output to file:

goodies -f md https://example.com -o article.md

5. Local File Processing

Convert a local HTML file to Markdown:

goodies file.html -f md

Convert with selector targeting:

goodies -s "#content" -f md local-file.html

Save to specific output file:

goodies file.html -f md -o converted.md

6. Batch Processing

Recursively convert all HTML files in a directory:

goodies -r ./docs

This will find all .html and .htm files in the ./docs directory and create corresponding .md files.

7. Using STDIN

Pipe HTML content from another command:

curl -s https://example.com | goodies -f md

Process HTML from a variable:

cat input.html | goodies -f md -o output.md

8. Advanced Scraping Scenarios

Scrape multiple pages (programmatically by modifying URLs list): (Note: The current implementation supports multiple URLs in the configuration, though the CLI accepts single input)

Extract all links from a page:

goodies -f json https://example.com | jq '.Links'

Extract images from a page:

goodies -f json https://example.com | jq '.Images'

Output Formats Explained

`json`

Returns a structured JSON object containing:

URL, title, status code
Extracted content
All links and images found
HTML structure (head, body, full)
CSS and JavaScript references
Timestamp

`text`

Human-readable formatted text with section headers and extracted content.

`raw`

Only the text content extracted from the targeted selector.

`html`

The original HTML of the targeted element or full page.

`complete`

A complete HTML document with external CSS and JavaScript resources inlined, making it self-contained.

`md`

Markdown format converted from the HTML content using html-to-markdown library with commonmark extensions.

Advanced Configuration

While the CLI provides basic options, you can customize the scraper further by modifying the GollyArgs structure in code:

config := &GollyArgs{
    URLs:           []string{"https://example.com"},
    UserAgent:      "CustomBot/1.0",
    Headers:        map[string]string{"Authorization": "Bearer token"},
    Delay:          2 * time.Second,
    Parallelism:    5,
    TargetSelector: ".content",
    OutputFormat:   "md",
    EnableDebug:    true,
    AllowedDomains: []string{"example.com"},
}

Error Handling

Failed scrapes are logged but don't stop processing of other URLs
Missing selectors generate warnings but continue processing
Invalid URLs or network errors are reported with details
File system errors in batch mode are reported per file

Performance Tips

Use appropriate delays when scraping multiple pages to avoid overloading servers
Limit parallelism for sensitive targets
Use selectors to extract only needed content, reducing memory usage
For batch processing, ensure sufficient system resources for large directories

Limitations

JavaScript-rendered content cannot be scraped (static HTML only)
Rate limiting is basic; respect websites' robots.txt manually
Very large pages may cause memory issues
Complex CSS selectors might not work as expected with the HTML-to-Markdown conversion

Development

Dependencies

github.com/gocolly/colly/v2 - Web scraping framework
github.com/PuerkitoBio/goquery - jQuery-like HTML parsing
github.com/JohannesKaufmann/html-to-markdown/v2 - HTML to Markdown conversion

Building and Testing

# Build
go build -o goodies

# Run tests (if available)
go test ./...

# Cross-compile for different platforms
GOOS=linux GOARCH=amd64 go build -o goodies-linux
GOOS=windows GOARCH=amd64 go build -o goodies.exe

Contributing

Fork the repository
Create a feature branch
Make changes with appropriate tests
Submit a pull request

License

[Specify license here]

Support

For issues and feature requests, please use the issue tracker on the repository.

Goodies combines the power of Go's concurrency with robust scraping and conversion libraries, making it an ideal tool for content migration, archiving, and web data extraction tasks.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
cmd		cmd
docs		docs
pkg		pkg
ui		ui
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
favicon.icns		favicon.icns
favicon.ico		favicon.ico
go.mod		go.mod
go.sum		go.sum
goodies.xml		goodies.xml
main.go		main.go
rsrc_windows_amd64.syso		rsrc_windows_amd64.syso

License

limpdev/goodies

Folders and files

Latest commit

History

Repository files navigation

Goodies: The All-in-One Web Scraper & HTML-to-Markdown Converter

Features

Installation

Prerequisites

From Source

Using go install

Usage

Basic Syntax

Flags

Usage Examples

1. Basic Web Scraping

2. Targeted Content Extraction

3. Different Output Formats

4. Convert to Markdown

5. Local File Processing

6. Batch Processing

7. Using STDIN

8. Advanced Scraping Scenarios

Output Formats Explained

json

text

raw

html

complete

md

Advanced Configuration

Error Handling

Performance Tips

Limitations

Development

Dependencies

Building and Testing

Contributing

License

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`json`

`text`

`raw`

`html`

`complete`

`md`

Packages