Skip to content

NTHU-SA/NTHU-Data-Scraper

Repository files navigation

NTHU-Data-Scraper

NTHU-Data-Scraper is a project designed for NTHU developers.
We scrape data from NTHU official website with GitHub Action and deliver it with our website.

Code style: black Crawl and update data

Maintainability Rating Lines of Code Technical Debt

Features

Available Spiders

  • nthu_announcements_list: Crawls and maintains a list of announcement pages
  • nthu_announcements_item: Updates content from the announcement list
  • nthu_buses: Scrapes campus bus schedules (supports new Nanda bus routes)
  • nthu_courses: Fetches course information
  • nthu_dining: Retrieves dining hall data
  • nthu_directory: Downloads department directory
  • nthu_maps: Gets campus map data
  • nthu_newsletters: Collects newsletter information

Recent Improvements

  • ✅ Refactored project structure with common utility modules
  • ✅ Split announcements spider into list and item crawlers
  • ✅ Added support for new Nanda bus route format
  • ✅ Unified JSON file operations across all spiders
  • ✅ Improved error handling and logging
  • ✅ Self-hosted runners now require manual trigger

Usage

Running Spiders Locally

# Install dependencies
pip install -r requirements.txt

# Run a single spider
python -m scrapy crawl nthu_buses

# For announcements, run list spider first, then item spider
python -m scrapy crawl nthu_announcements_list
python -m scrapy crawl nthu_announcements_item

GitHub Actions

The workflow runs automatically on:

  • Push to main branch
  • Scheduled every 2 hours
  • Manual trigger via workflow_dispatch

Self-hosted crawlers (directory, maps, newsletters) only run when manually triggered with run_self_hosted set to true.

Project Structure

NTHU-Data-Scraper/
├── nthu_scraper/
│   ├── spiders/          # Spider implementations
│   ├── utils/            # Common utilities
│   │   ├── constants.py  # Global constants
│   │   ├── file_utils.py # JSON file operations
│   │   └── url_utils.py  # URL processing utilities
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   └── settings.py
├── data/                 # Scraped data output
├── .github/
│   └── workflows/
│       └── update_data.yml
└── requirements.txt

Credit

This project is maintained by NTHUSA 32nd.

License

This project is licensed under the MIT License.

Acknowledgements

Thanks to SonarCloud for providing code quality metrics:

SonarCloud

About

Automatically scrape NTHU websites to fetch lastest data.

Resources

License

Stars

Watchers

Forks

Contributors

Languages