NTHU-Data-Scraper

NTHU-Data-Scraper is a project designed for NTHU developers.
We scrape data from NTHU official website with GitHub Action and deliver it with our website.

Features

Available Spiders

nthu_announcements_list: Crawls and maintains a list of announcement pages
nthu_announcements_item: Updates content from the announcement list
nthu_buses: Scrapes campus bus schedules (supports new Nanda bus routes)
nthu_courses: Fetches course information
nthu_dining: Retrieves dining hall data
nthu_directory: Downloads department directory
nthu_maps: Gets campus map data
nthu_newsletters: Collects newsletter information

Recent Improvements

✅ Refactored project structure with common utility modules
✅ Split announcements spider into list and item crawlers
✅ Added support for new Nanda bus route format
✅ Unified JSON file operations across all spiders
✅ Improved error handling and logging
✅ Self-hosted runners now require manual trigger

Usage

Running Spiders Locally

# Install dependencies
pip install -r requirements.txt

# Run a single spider
python -m scrapy crawl nthu_buses

# For announcements, run list spider first, then item spider
python -m scrapy crawl nthu_announcements_list
python -m scrapy crawl nthu_announcements_item

GitHub Actions

The workflow runs automatically on:

Push to main branch
Scheduled every 2 hours
Manual trigger via workflow_dispatch

Self-hosted crawlers (directory, maps, newsletters) only run when manually triggered with run_self_hosted set to true.

Project Structure

NTHU-Data-Scraper/
├── nthu_scraper/
│   ├── spiders/          # Spider implementations
│   ├── utils/            # Common utilities
│   │   ├── constants.py  # Global constants
│   │   ├── file_utils.py # JSON file operations
│   │   └── url_utils.py  # URL processing utilities
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   └── settings.py
├── data/                 # Scraped data output
├── .github/
│   └── workflows/
│       └── update_data.yml
└── requirements.txt

Credit

This project is maintained by NTHUSA 32nd.

License

This project is licensed under the MIT License.

Acknowledgements

Thanks to SonarCloud for providing code quality metrics:

Name		Name	Last commit message	Last commit date
Latest commit History 2,448 Commits
.github/workflows		.github/workflows
data		data
nthu_scraper		nthu_scraper
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
generate_file_detail.py		generate_file_detail.py
generate_index.py		generate_index.py
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg
workflow.md		workflow.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NTHU-Data-Scraper

Features

Available Spiders

Recent Improvements

Usage

Running Spiders Locally

GitHub Actions

Project Structure

Credit

License

Acknowledgements

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NTHU-Data-Scraper

Features

Available Spiders

Recent Improvements

Usage

Running Spiders Locally

GitHub Actions

Project Structure

Credit

License

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages