Flash is a fast and lightweight web crawler written in Go. It recursively visits web pages, extracts links, and can be configured for SEO auditing, content discovery, or data collection.
- ⚡ Fast Concurrent Crawling: Efficiently processes multiple pages
- 🌐 Recursive URL Discovery: Automatically follows links to explore websites
- 🧩 Simple & Modular: Easy to extend with your own functionality
- 📊 URL Normalization: Handles various URL formats consistently
- 🐳 Docker Support: Ready for containerized deployments
# Clone the repository
git clone https://github.com/sudonitj/Flash.git
cd Flash
# Install dependencies
go mod tidy
# Build the application
go build -o flash
# Build the Docker image
docker build -t flash-crawler .
# Run from compiled binary
./flash https://example.com
# Or using go run
go run main.go https://example.com
# Run with Docker
docker run --rm flash-crawler https://example.com
flash/
├── main.go # Entry point
├── crawler/ # Core crawler package
│ ├── crawler.go # Crawling logic
│ ├── get_urls.go # HTML parsing for URLs
│ ├── normalize_url.go # URL normalization
│ ├── get_url_test.go # Tests for URL extraction
│ └── normalize_url_test.go# Tests for normalization
├── go.mod # Module definition
├── go.sum # Dependency checksums
├── Dockerfile # Docker configuration
└── README.md # This file
- The crawler starts with a given URL (the seed)
- It visits the page, extracts all links from the HTML
- Links are normalized to prevent duplicates
- Each discovered URL is added to the queue if not already visited
- The process repeats until no more URLs are left to visit
Flash uses the standard Go library for HTTP requests and HTML parsing, making it lightweight with minimal dependencies.
# Run all tests
go test ./...
# Run tests for a specific package
go test ./crawler
Flash is designed to be modular. You can extend its functionality by:
- Adding more analysis during crawling
- Implementing depth limits
- Adding domain filtering
- Implementing rate limiting
- Adding data extraction capabilities
- SEO Auditing: Find broken links and analyze site structure
- Content Discovery: Map out all accessible pages on a website
- Data Collection: Extract specific types of content from pages
- Site Monitoring: Track changes to pages over time
When using Flash, please:
- Respect website terms of service
- Consider adding rate limiting to avoid overloading servers
- Add support for robots.txt to respect crawling directives
- Only crawl sites you have permission to access
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add some amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Go community for excellent HTTP and HTML parsing libraries
- All contributors who help improve Flash