"Just a simple tool that finds links so you don't have to."
This web crawler scans websites and maps out all their links. It sticks to the domain you give it (because jumping between sites would be rude). I made this to learn about web scraping and to save myself some time when exploring site structures.
- Node.js - For handling asynchronous operations and running the crawler
- JSDOM - Parses HTML without needing a browser
- Jest - Makes sure my code actually works before I break the internet
- URL API - Way better than trying to write regex for URLs (trust me, I tried)
# Get the code
git clone https://github.com/JohnRaivenOlazo/web-crawler.git
# Install what it needs
npm install
# Run it on any website
npm start https://example.com
- Visits the website you specify
- Finds all links that stay on the same domain
- Keeps track of how many times each link appears
- Shows you a list sorted by popularity
- Saves you hours of manual clicking and tracking
npm test
The tests check things like making sure URLs like https://raiven.com/path/
and https://raiven.com/path
are treated the same (because trailing slashes are annoying).
1. Clean up URLs so they're consistent
2. Visit pages and collect their links
3. Sort everything and show results