NTHU-Data-Scraper is a project designed for NTHU developers.
We scrape data from NTHU official website with GitHub Action and deliver it with our website.
- nthu_announcements_list: Crawls and maintains a list of announcement pages
- nthu_announcements_item: Updates content from the announcement list
- nthu_buses: Scrapes campus bus schedules (supports new Nanda bus routes)
- nthu_courses: Fetches course information
- nthu_dining: Retrieves dining hall data
- nthu_directory: Downloads department directory
- nthu_maps: Gets campus map data
- nthu_newsletters: Collects newsletter information
- ✅ Refactored project structure with common utility modules
- ✅ Split announcements spider into list and item crawlers
- ✅ Added support for new Nanda bus route format
- ✅ Unified JSON file operations across all spiders
- ✅ Improved error handling and logging
- ✅ Self-hosted runners now require manual trigger
# Install dependencies
pip install -r requirements.txt
# Run a single spider
python -m scrapy crawl nthu_buses
# For announcements, run list spider first, then item spider
python -m scrapy crawl nthu_announcements_list
python -m scrapy crawl nthu_announcements_itemThe workflow runs automatically on:
- Push to main branch
- Scheduled every 2 hours
- Manual trigger via workflow_dispatch
Self-hosted crawlers (directory, maps, newsletters) only run when manually triggered with run_self_hosted set to true.
NTHU-Data-Scraper/
├── nthu_scraper/
│ ├── spiders/ # Spider implementations
│ ├── utils/ # Common utilities
│ │ ├── constants.py # Global constants
│ │ ├── file_utils.py # JSON file operations
│ │ └── url_utils.py # URL processing utilities
│ ├── items.py
│ ├── middlewares.py
│ ├── pipelines.py
│ └── settings.py
├── data/ # Scraped data output
├── .github/
│ └── workflows/
│ └── update_data.yml
└── requirements.txt
This project is maintained by NTHUSA 32nd.
This project is licensed under the MIT License.
Thanks to SonarCloud for providing code quality metrics: