You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: apps/docs/src/content/docs/architecture/data-flow.md
+19-3Lines changed: 19 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,6 +8,8 @@ Understanding the flow of data is crucial to comprehending how AlbertPlus works.
8
8
9
9
The primary data pipeline is responsible for collecting, storing, and serving course and program information.
10
10
11
+
### Static Course & Program Data (Manual Trigger)
12
+
11
13
1.**Scraping (Cloudflare Worker)**
12
14
-**Admin Trigger**: Admin users initiate scraping by calling Convex actions (`api.scraper.triggerMajorsScraping` or `api.scraper.triggerCoursesScraping`).
13
15
-**Authenticated Request**: The Convex action makes a POST request to the scraper's HTTP endpoints (`/api/trigger-majors` or `/api/trigger-courses`) with the `CONVEX_API_KEY` in the `X-API-KEY` header.
@@ -17,14 +19,28 @@ The primary data pipeline is responsible for collecting, storing, and serving co
17
19
-**Data Extraction**: Each job in the queue is processed by the worker, which scrapes the detailed information for a specific course or program.
18
20
-**Upsert to Backend**: The scraped data is sent back to the Convex backend via authenticated HTTP endpoints.
-**Scheduled Trigger**: A cronjob runs at regular intervals (configured in `wrangler.jsonc`).
26
+
-**Config Check**: The worker reads app configuration from Convex to determine which terms to scrape (`is_scrape_current`, `is_scrape_next`, along with term/year information).
27
+
-**Albert Public Search**: For each enabled term, the worker scrapes Albert's public class search to discover all course offering URLs.
28
+
-**Job Queuing**: Each course offering URL is added to the queue as a `course-offering` job with metadata about the term and year.
29
+
-**Section Details**: Each job scrapes detailed information including:
30
+
- Class number, section, and status (open/closed/waitlist)
31
+
- Instructor names and location
32
+
- Meeting days, start time, and end time
33
+
- Corequisite relationships
34
+
-**Batch Upsert**: Scraped course offerings are sent to Convex in batches via the `/api/courseOfferings/upsert` endpoint.
35
+
20
36
2.**Backend Processing (Convex)**
21
37
-**Data Reception**: The Convex backend receives the scraped data from the Cloudflare Worker.
22
-
-**Database Storage**: The data is upserted into the Convex database, ensuring that existing records are updated and new ones are created. This includes courses, programs, requirements, and prerequisites.
38
+
-**Database Storage**: The data is upserted into the Convex database, ensuring that existing records are updated and new ones are created. This includes courses, programs, requirements, prerequisites, and course offerings.
23
39
-**Real-time Updates**: Any clients connected to the Convex backend (such as the web app) will receive real-time updates as the new data is written to the database.
-**Data Fetching**: The Next.js web app and the browser extension query the Convex backend to fetch courseand program data.
27
-
-**User Interface**: The data is then rendered in the user interface, allowing students to browse the course catalog, view program requirements, and build their schedules.
42
+
-**Data Fetching**: The Next.js web app and the browser extension query the Convex backend to fetch course, program, and course offering data.
43
+
-**User Interface**: The data is then rendered in the user interface, allowing students to browse the course catalog, view program requirements, check real-time class availability, and build their schedules.
Copy file name to clipboardExpand all lines: apps/docs/src/content/docs/architecture/overview.md
+3-1Lines changed: 3 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,7 +10,9 @@ The AlbertPlus ecosystem is composed of several distinct yet interconnected appl
10
10
11
11
-**Web Application**: A feature-rich Next.js application that serves as the primary user interface for course planning and schedule building.
12
12
-**Browser Extension**: A Chrome extension built with Plasmo that integrates with the native Albert website, providing a seamless user experience.
13
-
-**Web Scraper**: A Cloudflare Worker that periodically scrapes course data from NYU's public-facing systems to ensure the information in AlbertPlus is always up-to-date.
13
+
-**Web Scraper**: A Cloudflare Worker that scrapes course data from NYU's public-facing systems, including:
14
+
-**Manual Scraping**: Admin-triggered scraping of static course catalog and program data from NYU bulletins
15
+
-**Automated Scraping**: Scheduled cronjob that scrapes real-time course offerings (sections, availability, schedules) from Albert public search
14
16
-**Serverless Backend**: A Convex-powered backend that provides a real-time database, serverless functions, and authentication services.
15
17
-**Documentation Site**: An Astro and Starlight-based website that you are currently viewing, which serves as the central hub for all project documentation.
@@ -29,11 +31,44 @@ The scraping process is designed to be robust and resilient:
29
31
7.**Data Upsert**: The scraped data is then sent to the Convex backend via authenticated HTTP requests to be stored in the main database.
30
32
8.**Error Handling**: The system includes error logging and a retry mechanism for failed jobs.
31
33
34
+
### Automated Scraping (Course Offerings)
35
+
36
+
Course offerings (class sections with schedule details) are scraped automatically via a scheduled Cloudflare Worker cronjob:
37
+
38
+
1.**Scheduled Trigger**: The worker runs on a schedule defined in `wrangler.jsonc` to check for new course offerings.
39
+
2.**App Config Check**: The worker reads the following configuration from Convex:
40
+
-`is_scrape_current` - Boolean flag to enable scraping current term
41
+
-`is_scrape_next` - Boolean flag to enable scraping next term
42
+
-`current_term` / `current_year` - Identifies the current academic term
43
+
-`next_term` / `next_year` - Identifies the next academic term
44
+
3.**Discovery Jobs**: For each enabled term, the worker creates a `discover-course-offerings` job that scrapes Albert's public search to find all course offering URLs.
45
+
4.**Individual Jobs**: Each discovered course offering URL becomes a `course-offering` job in the queue.
46
+
5.**Data Processing**: The worker scrapes details such as class number, section, instructor, schedule, location, and enrollment status.
47
+
6.**Backend Sync**: Scraped course offerings are sent to Convex via the `/api/courseOfferings/upsert` endpoint in batches.
48
+
49
+
## Job Types
50
+
51
+
The scraper supports the following job types, tracked in the D1 database:
52
+
53
+
| Job Type | Description |
54
+
|----------|-------------|
55
+
|`discover-programs`| Discovers all program URLs from the bulletin |
56
+
|`discover-courses`| Discovers all course URLs from the bulletin |
57
+
|`discover-course-offerings`| Discovers course offering URLs from Albert public search for a specific term/year |
58
+
|`program`| Scrapes detailed data for a single program |
59
+
|`course`| Scrapes detailed data for a single course |
60
+
|`course-offering`| Scrapes detailed data for a single course offering (section) |
61
+
62
+
Jobs can include metadata (stored as JSON) to pass contextual information such as the academic term and year.
63
+
32
64
## Project Structure
33
65
34
66
The scraper's code is organized as follows:
35
67
36
68
-`src/index.ts`: The main entry point for the Cloudflare Worker, including the scheduled and queue handlers.
37
69
-`src/drizzle/`: The Drizzle ORM schema and database connection setup.
38
70
-`src/lib/`: Core libraries for interacting with Convex and managing the job queue.
39
-
-`src/modules/`: The logic for discovering and scraping courses and programs.
71
+
-`src/modules/`: The logic for discovering and scraping courses, programs, and course offerings.
72
+
-`programs/`: Program discovery and scraping logic
73
+
-`courses/`: Course discovery and scraping logic
74
+
-`courseOfferings/`: Course offering discovery and scraping logic (in progress)
0 commit comments