Skip to content

Umoru98/remote-job-aggregator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ Remote Job Aggregator

Table of Contents


πŸ“„ About

Remote Job Aggregator is a robust Node.js-based web scraping and job notification system that automates the discovery of remote job postings. It uses Puppeteer for headless browser automation and Mongoose for storing jobs in a MongoDB database.

The system:

  • Prevents duplicate entries.
  • Sends automated email notifications using Nodemailer.
  • Exposes an Express.js API for fetching stored jobs and manually triggering scrapes.

✨ Features

  • Multi-Source Scraping: Pulls jobs from top remote job boards.
  • Persistent Storage: Saves jobs in MongoDB with schema validation.
  • Duplicate Prevention: Ensures only unique jobs are saved.
  • Automated Notifications: Sends grouped email alerts for new jobs.
  • API Endpoints:
    • GET /api/jobs – Fetch all jobs.
    • GET /api/jobs/scrape – Trigger a manual scrape.
  • Scheduled Automation: Cron-based periodic scraping and notifications.
  • Anti-Bot Evasion: Supports puppeteer-extra with stealth plugins.

🌐 Supported Job Boards

  • Indeed.com
  • WeWorkRemotely.com
  • RemoteOK.com
  • Remote.co

βš™οΈ Prerequisites

  • Node.js v18.x or later
  • npm v8.x or Yarn
  • MongoDB (local or MongoDB Atlas)
  • Gmail account (or another SMTP provider)

πŸ“¦ Installation

  1. Clone the repository:

    git clone https://github.com/Umoru98/remote-job-aggregator.git
    cd remote-job-aggregator
  2. Install dependencies:

npm install
# or
yarn install
  1. Chromium Installation: Puppeteer automatically downloads Chromium. If issues occur, ensure a working Chromium/Chrome is available.

βš™οΈ Configuration

Create a .env file in the root directory:

# Server Port
PORT=5000

# MongoDB Connection
MONGO_URI=mongodb+srv://[YOUR_USERNAME]:[YOUR_PASSWORD]@cluster0.mongodb.net/?retryWrites=true&w=majority

# Email Configuration
EMAIL_USER=your-email@gmail.com
EMAIL_PASS=your-email-app-password
NOTIFY_EMAIL=recipient-email@example.com

# Cron Job Schedule (Recommended: Every 6 hours)
CRON_SCHEDULE=0 */6 * * *

# Proxy (Optional for sites with aggressive anti-bot detection)
# PROXY_URL=http://username:password@proxy.example.com:port

πŸš€ Usage

Starting the API Server

npm start
# or
node server.js

The server will run on the PORT defined in your .env file (default: 5000).

Triggering Scrapes Manually

Access this endpoint in your browser or via Postman:

GET http://localhost:5000/api/jobs/scrape

Scheduled Scraping (Cron Jobs)

The scraper runs automatically based on the CRON_SCHEDULE in your .env file.

Cron Schedule Examples:

*/3 * * * * – Every 3 minutes (Not Recommended)

0 */6 * * * – Every 6 hours

0 9 * * * – Every day at 9:00 AM

Note: Use longer intervals to reduce the risk of IP bans, especially on sites like Indeed.


Database Schema

Jobs are stored with the following fields:

  • title – Job title
  • company – Company name
  • url – Job link (unique)
  • source – Job board source
  • dateScraped – Timestamp

Troubleshooting & Common Issues

Cloudflare/Anti-Bot Detections

Symptoms:

  • Timeout errors

  • "Checking your browser..." messages

  • CAPTCHA challenges

Solutions:

  • Reduce scraping frequency.

  • Use paid residential proxies.

  • Keep Puppeteer and user-agent strings updated.

Scraper Breakage (Selector Changes)

Symptoms:

  • Zero jobs scraped

  • Incorrect job titles or URLs

Solutions:

  • Set headless: false in puppeteerHelper.js to debug.

  • Inspect the site’s new DOM structure.

  • Update the selectors in the respective scraper file.

Email Sending Issues

Symptoms:

  • Email errors in console

  • No emails received

Solutions:

  • Ensure correct email credentials in .env.

  • Use an App Password for Gmail (not your account password).

  • Verify the recipient email address.

High Resource Usage

Symptoms:

  • Puppeteer processes hanging

  • High CPU or memory usage

Solutions:

  • Use proper page.close() and browser.close() calls.

  • Limit concurrent scrapes.

  • Consider headless browsers in Docker for better control.


Future Enhancements

  • Add more job boards.

  • Keyword and company-based filtering.

  • Slack/Discord webhook support.

  • API authentication (API keys or JWT).

  • Docker containerization.

  • Advanced duplicate detection across multiple sources.

  • Detailed job page scraping.


Contributing

Contributions are welcome! Feel free to:

  • Fork this repository

  • Create a new branch

  • Submit a pull request


License

This project is licensed under the MIT License.


Contact

For inquiries, contact: umoruvictor98@gmail.com

About

A Node.js-powered web scraping application that aggregates remote job postings from multiple online job platforms. Features automated data collection, MongoDB storage, Express.js API, email alerts, and scheduled updates via cron jobs.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors