Golly & Gong are two peas in a pod, which is to say, I've been too lazy to merge them into a properly powerful HTTP fetching, HTML formulating, and Markdown generating pure Golang web-scraper. Golly scrapes the link, returns HTML Gong converts that raw HTML into purified Markdown. Surprisingly, Gong in particular has operated considerably better than literally every other solution I've tried from within the realm of HTML-to-Markdown conversion. I didn't intend on building the most feugo, supreme Markdown generator (it probably isn't I know), but no shit it works better than the other shit I've tried.
Both project summaries are found in their respective XML files, aptly named.
Goodies is a powerful Go-based command-line tool that combines web scraping capabilities with HTML-to-Markdown conversion. It can scrape web pages with fine-grained control, extract specific content using CSS selectors, and convert HTML content (from web pages or local files) into clean Markdown format.
- 🌐 Web Scraping: Fetch and extract content from websites
- 🎯 CSS Selector Targeting: Extract specific DOM elements using CSS selectors
- 📁 Local File Processing: Convert local HTML files to Markdown
- 📂 Batch Processing: Recursively convert entire directories of HTML files
- 🔄 Multiple Output Formats: JSON, plain text, raw content, HTML, complete HTML (with inlined resources), and Markdown
- ⚡ Concurrent Scraping: Parallel processing with configurable limits
- 🎨 Resource Inlining: Automatically inline CSS and JavaScript for complete HTML output
- 📝 Flexible Input: Accept URLs, local files, or stdin input
- 🛡️ Robust Error Handling: Graceful error recovery and logging
- Go 1.16 or higher
git clone <repository-url>
cd goodies
go build -o goodiesgo install github.com/your-username/goodies@latestgoodies [flags] <URL|FILE|DIR>| Flag | Description | Default |
|---|---|---|
-o |
Output file path | (stdout) |
-r |
Recursively process directories (Markdown conversion mode only) | false |
-s |
CSS selector to target (e.g., 'article', '#content') | "" |
-a |
User Agent string | "Mozilla/5.0 (Compatible; Goodies/1.0)" |
-f |
Output format: complete, html, text, json, raw, md |
complete |
Scrape a webpage and output complete HTML (default):
goodies https://example.comScrape with a custom user agent:
goodies -a "MyBot/1.0" https://example.comExtract only the article content:
goodies -s "article" https://news-site.com/article-123Extract content from a specific div:
goodies -s "#main-content" https://blog.example.com/postJSON output (full structured data):
goodies -f json https://example.comPlain text output:
goodies -f text https://example.comRaw content only:
goodies -f raw https://example.comOriginal HTML:
goodies -f html https://example.comComplete HTML (with inlined resources):
goodies -f complete https://example.comScrape and convert to Markdown:
goodies -f md https://example.com/blog/postScrape specific content and convert to Markdown:
goodies -s ".post-content" -f md https://example.com/blog/postSave output to file:
goodies -f md https://example.com -o article.mdConvert a local HTML file to Markdown:
goodies file.html -f mdConvert with selector targeting:
goodies -s "#content" -f md local-file.htmlSave to specific output file:
goodies file.html -f md -o converted.mdRecursively convert all HTML files in a directory:
goodies -r ./docsThis will find all .html and .htm files in the ./docs directory and create corresponding .md files.
Pipe HTML content from another command:
curl -s https://example.com | goodies -f mdProcess HTML from a variable:
cat input.html | goodies -f md -o output.mdScrape multiple pages (programmatically by modifying URLs list): (Note: The current implementation supports multiple URLs in the configuration, though the CLI accepts single input)
Extract all links from a page:
goodies -f json https://example.com | jq '.Links'Extract images from a page:
goodies -f json https://example.com | jq '.Images'Returns a structured JSON object containing:
- URL, title, status code
- Extracted content
- All links and images found
- HTML structure (head, body, full)
- CSS and JavaScript references
- Timestamp
Human-readable formatted text with section headers and extracted content.
Only the text content extracted from the targeted selector.
The original HTML of the targeted element or full page.
A complete HTML document with external CSS and JavaScript resources inlined, making it self-contained.
Markdown format converted from the HTML content using html-to-markdown library with commonmark extensions.
While the CLI provides basic options, you can customize the scraper further by modifying the GollyArgs structure in code:
config := &GollyArgs{
URLs: []string{"https://example.com"},
UserAgent: "CustomBot/1.0",
Headers: map[string]string{"Authorization": "Bearer token"},
Delay: 2 * time.Second,
Parallelism: 5,
TargetSelector: ".content",
OutputFormat: "md",
EnableDebug: true,
AllowedDomains: []string{"example.com"},
}- Failed scrapes are logged but don't stop processing of other URLs
- Missing selectors generate warnings but continue processing
- Invalid URLs or network errors are reported with details
- File system errors in batch mode are reported per file
- Use appropriate delays when scraping multiple pages to avoid overloading servers
- Limit parallelism for sensitive targets
- Use selectors to extract only needed content, reducing memory usage
- For batch processing, ensure sufficient system resources for large directories
- JavaScript-rendered content cannot be scraped (static HTML only)
- Rate limiting is basic; respect websites'
robots.txtmanually - Very large pages may cause memory issues
- Complex CSS selectors might not work as expected with the HTML-to-Markdown conversion
github.com/gocolly/colly/v2- Web scraping frameworkgithub.com/PuerkitoBio/goquery- jQuery-like HTML parsinggithub.com/JohannesKaufmann/html-to-markdown/v2- HTML to Markdown conversion
# Build
go build -o goodies
# Run tests (if available)
go test ./...
# Cross-compile for different platforms
GOOS=linux GOARCH=amd64 go build -o goodies-linux
GOOS=windows GOARCH=amd64 go build -o goodies.exe- Fork the repository
- Create a feature branch
- Make changes with appropriate tests
- Submit a pull request
[Specify license here]
For issues and feature requests, please use the issue tracker on the repository.
Goodies combines the power of Go's concurrency with robust scraping and conversion libraries, making it an ideal tool for content migration, archiving, and web data extraction tasks.