- About
- Features
- Supported Job Boards
- Prerequisites
- Installation
- Configuration
- Usage
- Project Structure
- Database Schema
- Troubleshooting & Common Issues
- Future Enhancements
- Contributing
- License
- Contact
Remote Job Aggregator is a robust Node.js-based web scraping and job notification system that automates the discovery of remote job postings. It uses Puppeteer for headless browser automation and Mongoose for storing jobs in a MongoDB database.
The system:
- Prevents duplicate entries.
- Sends automated email notifications using Nodemailer.
- Exposes an Express.js API for fetching stored jobs and manually triggering scrapes.
- Multi-Source Scraping: Pulls jobs from top remote job boards.
- Persistent Storage: Saves jobs in MongoDB with schema validation.
- Duplicate Prevention: Ensures only unique jobs are saved.
- Automated Notifications: Sends grouped email alerts for new jobs.
- API Endpoints:
GET /api/jobsβ Fetch all jobs.GET /api/jobs/scrapeβ Trigger a manual scrape.
- Scheduled Automation: Cron-based periodic scraping and notifications.
- Anti-Bot Evasion: Supports
puppeteer-extrawith stealth plugins.
- Indeed.com
- WeWorkRemotely.com
- RemoteOK.com
- Remote.co
- Node.js v18.x or later
- npm v8.x or Yarn
- MongoDB (local or MongoDB Atlas)
- Gmail account (or another SMTP provider)
-
Clone the repository:
git clone https://github.com/Umoru98/remote-job-aggregator.git cd remote-job-aggregator -
Install dependencies:
npm install
# or
yarn install- Chromium Installation: Puppeteer automatically downloads Chromium. If issues occur, ensure a working Chromium/Chrome is available.
Create a .env file in the root directory:
# Server Port
PORT=5000
# MongoDB Connection
MONGO_URI=mongodb+srv://[YOUR_USERNAME]:[YOUR_PASSWORD]@cluster0.mongodb.net/?retryWrites=true&w=majority
# Email Configuration
EMAIL_USER=your-email@gmail.com
EMAIL_PASS=your-email-app-password
NOTIFY_EMAIL=recipient-email@example.com
# Cron Job Schedule (Recommended: Every 6 hours)
CRON_SCHEDULE=0 */6 * * *
# Proxy (Optional for sites with aggressive anti-bot detection)
# PROXY_URL=http://username:password@proxy.example.com:portnpm start
# or
node server.jsThe server will run on the PORT defined in your .env file (default: 5000).
Access this endpoint in your browser or via Postman:
GET http://localhost:5000/api/jobs/scrapeThe scraper runs automatically based on the CRON_SCHEDULE in your .env file.
*/3 * * * * β Every 3 minutes (Not Recommended)
0 */6 * * * β Every 6 hours
0 9 * * * β Every day at 9:00 AM
Note: Use longer intervals to reduce the risk of IP bans, especially on sites like Indeed.
Jobs are stored with the following fields:
titleβ Job titlecompanyβ Company nameurlβ Job link (unique)sourceβ Job board sourcedateScrapedβ Timestamp
-
Timeout errors
-
"Checking your browser..." messages
-
CAPTCHA challenges
-
Reduce scraping frequency.
-
Use paid residential proxies.
-
Keep Puppeteer and user-agent strings updated.
-
Zero jobs scraped
-
Incorrect job titles or URLs
-
Set headless: false in puppeteerHelper.js to debug.
-
Inspect the siteβs new DOM structure.
-
Update the selectors in the respective scraper file.
-
Email errors in console
-
No emails received
-
Ensure correct email credentials in
.env. -
Use an App Password for Gmail (not your account password).
-
Verify the recipient email address.
-
Puppeteer processes hanging
-
High CPU or memory usage
-
Use proper page.close() and browser.close() calls.
-
Limit concurrent scrapes.
-
Consider headless browsers in Docker for better control.
-
Add more job boards.
-
Keyword and company-based filtering.
-
Slack/Discord webhook support.
-
API authentication (API keys or JWT).
-
Docker containerization.
-
Advanced duplicate detection across multiple sources.
-
Detailed job page scraping.
Contributions are welcome! Feel free to:
-
Fork this repository
-
Create a new branch
-
Submit a pull request
This project is licensed under the MIT License.
For inquiries, contact: umoruvictor98@gmail.com