Skip to content

Commit a79e3a9

Browse files
authored
feat(scraper): add scraper job logic for course offerings (#82)
1 parent 77b6597 commit a79e3a9

File tree

14 files changed

+293
-60
lines changed

14 files changed

+293
-60
lines changed

apps/docs/src/content/docs/architecture/data-flow.md

Lines changed: 19 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,8 @@ Understanding the flow of data is crucial to comprehending how AlbertPlus works.
88

99
The primary data pipeline is responsible for collecting, storing, and serving course and program information.
1010

11+
### Static Course & Program Data (Manual Trigger)
12+
1113
1. **Scraping (Cloudflare Worker)**
1214
- **Admin Trigger**: Admin users initiate scraping by calling Convex actions (`api.scraper.triggerMajorsScraping` or `api.scraper.triggerCoursesScraping`).
1315
- **Authenticated Request**: The Convex action makes a POST request to the scraper's HTTP endpoints (`/api/trigger-majors` or `/api/trigger-courses`) with the `CONVEX_API_KEY` in the `X-API-KEY` header.
@@ -17,14 +19,28 @@ The primary data pipeline is responsible for collecting, storing, and serving co
1719
- **Data Extraction**: Each job in the queue is processed by the worker, which scrapes the detailed information for a specific course or program.
1820
- **Upsert to Backend**: The scraped data is sent back to the Convex backend via authenticated HTTP endpoints.
1921

22+
### Dynamic Course Offerings Data (Scheduled)
23+
24+
1. **Automated Scraping (Cloudflare Worker Cronjob)**
25+
- **Scheduled Trigger**: A cronjob runs at regular intervals (configured in `wrangler.jsonc`).
26+
- **Config Check**: The worker reads app configuration from Convex to determine which terms to scrape (`is_scrape_current`, `is_scrape_next`, along with term/year information).
27+
- **Albert Public Search**: For each enabled term, the worker scrapes Albert's public class search to discover all course offering URLs.
28+
- **Job Queuing**: Each course offering URL is added to the queue as a `course-offering` job with metadata about the term and year.
29+
- **Section Details**: Each job scrapes detailed information including:
30+
- Class number, section, and status (open/closed/waitlist)
31+
- Instructor names and location
32+
- Meeting days, start time, and end time
33+
- Corequisite relationships
34+
- **Batch Upsert**: Scraped course offerings are sent to Convex in batches via the `/api/courseOfferings/upsert` endpoint.
35+
2036
2. **Backend Processing (Convex)**
2137
- **Data Reception**: The Convex backend receives the scraped data from the Cloudflare Worker.
22-
- **Database Storage**: The data is upserted into the Convex database, ensuring that existing records are updated and new ones are created. This includes courses, programs, requirements, and prerequisites.
38+
- **Database Storage**: The data is upserted into the Convex database, ensuring that existing records are updated and new ones are created. This includes courses, programs, requirements, prerequisites, and course offerings.
2339
- **Real-time Updates**: Any clients connected to the Convex backend (such as the web app) will receive real-time updates as the new data is written to the database.
2440

2541
3. **Client-side Consumption (Web App & Browser Extension)**
26-
- **Data Fetching**: The Next.js web app and the browser extension query the Convex backend to fetch course and program data.
27-
- **User Interface**: The data is then rendered in the user interface, allowing students to browse the course catalog, view program requirements, and build their schedules.
42+
- **Data Fetching**: The Next.js web app and the browser extension query the Convex backend to fetch course, program, and course offering data.
43+
- **User Interface**: The data is then rendered in the user interface, allowing students to browse the course catalog, view program requirements, check real-time class availability, and build their schedules.
2844

2945
## Degree Progress Report Parsing
3046

apps/docs/src/content/docs/architecture/overview.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,9 @@ The AlbertPlus ecosystem is composed of several distinct yet interconnected appl
1010

1111
- **Web Application**: A feature-rich Next.js application that serves as the primary user interface for course planning and schedule building.
1212
- **Browser Extension**: A Chrome extension built with Plasmo that integrates with the native Albert website, providing a seamless user experience.
13-
- **Web Scraper**: A Cloudflare Worker that periodically scrapes course data from NYU's public-facing systems to ensure the information in AlbertPlus is always up-to-date.
13+
- **Web Scraper**: A Cloudflare Worker that scrapes course data from NYU's public-facing systems, including:
14+
- **Manual Scraping**: Admin-triggered scraping of static course catalog and program data from NYU bulletins
15+
- **Automated Scraping**: Scheduled cronjob that scrapes real-time course offerings (sections, availability, schedules) from Albert public search
1416
- **Serverless Backend**: A Convex-powered backend that provides a real-time database, serverless functions, and authentication services.
1517
- **Documentation Site**: An Astro and Starlight-based website that you are currently viewing, which serves as the central hub for all project documentation.
1618

apps/docs/src/content/docs/getting-started/environment-variables.md

Lines changed: 20 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -27,29 +27,37 @@ These variables are needed for the Chrome browser extension.
2727

2828
## Scraper (`apps/scraper`)
2929

30-
These variables are required for the Cloudflare Worker scraper.
30+
These environment variables are required for the Cloudflare Worker scraper.
3131

32-
| Variable | Description |
33-
| ------------------- | ------------------------------------------------------------------------------- |
34-
| `CONVEX_SITE_URL` | The HTTP API URL for your Convex backend. |
35-
| `CONVEX_API_KEY` | An API key for authenticating with the Convex backend. |
36-
| `SCRAPING_BASE_URL` | The base URL for the NYU course bulletins (e.g., `https://bulletins.nyu.edu/`). |
32+
| Variable | Description |
33+
| ----------------- | ------------------------------------------------------ |
34+
| `CONVEX_SITE_URL` | The HTTP API URL for your Convex backend. |
35+
| `CONVEX_API_KEY` | An API key for authenticating with the Convex backend. |
3736

3837
## Convex Backend (`packages/server`)
3938

4039
These variables are configured in your Convex deployment environment.
4140

42-
| Variable | Description |
43-
| ------------------------- | ---------------------------------------------------------------------------------- |
44-
| `CLERK_JWT_ISSUER_DOMAIN` | The JWT issuer domain from your Clerk account for token validation. |
41+
| Variable | Description |
42+
| ------------------------- | ----------------------------------------------------------------------------------- |
43+
| `CLERK_JWT_ISSUER_DOMAIN` | The JWT issuer domain from your Clerk account for token validation. |
4544
| `CONVEX_API_KEY` | A shared API key for authenticating requests between Convex and the scraper worker. |
46-
| `SCRAPER_URL` | The URL of the deployed scraper worker (e.g., `https://scraper.albertplus.com`). |
45+
| `SCRAPER_URL` | The URL of the deployed scraper worker (e.g., `https://scraper.albertplus.com`). |
4746

48-
## Cloudflare Worker Bindings
47+
## Cloudflare Worker Configuration
4948

50-
These are not environment variables in the traditional sense, but rather bindings configured in the `wrangler.jsonc` file.
49+
These are configured in `wrangler.jsonc`.
50+
51+
### Bindings
5152

5253
| Binding | Type | Description |
5354
| ---------------- | ----------- | ---------------------------------------- |
5455
| `SCRAPING_QUEUE` | Queue | Binding for the Cloudflare Worker queue. |
5556
| `DB` | D1 Database | Binding for the Cloudflare D1 database. |
57+
58+
### Variables
59+
60+
| Variable | Description |
61+
| -------------------------- | ---------------------------------------------------------------------------------------------- |
62+
| `SCRAPING_BASE_URL` | The base URL for NYU course bulletins (e.g., `https://bulletins.nyu.edu/`). |
63+
| `ALBERT_SCRAPING_BASE_URL` | The base URL for Albert public class search (e.g., `https://bulletins.nyu.edu/class-search/`). |

apps/docs/src/content/docs/modules/convex.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,3 +35,16 @@ bun run dashboard
3535
| `userCourseOfferings` | Links users to the specific course offerings they have added to their schedule. |
3636
| `students` | Stores student-specific information, linked to a Clerk user ID. |
3737
| `schools` | A list of the different schools within NYU. |
38+
39+
## App Configuration Keys
40+
41+
The `appConfigs` table stores various configuration settings that control scraper behavior and term information:
42+
43+
| Key | Type | Description |
44+
|-----|------|-------------|
45+
| `current_term` | `"spring" \| "summer" \| "fall" \| "j-term"` | The current academic term |
46+
| `current_year` | `string` | The current academic year (e.g., `"2025"`) |
47+
| `next_term` | `"spring" \| "summer" \| "fall" \| "j-term"` | The next academic term |
48+
| `next_year` | `string` | The next academic year |
49+
| `is_scrape_current` | `"true" \| "false"` | Flag to enable/disable scraping of current term course offerings |
50+
| `is_scrape_next` | `"true" \| "false"` | Flag to enable/disable scraping of next term course offerings |

apps/docs/src/content/docs/modules/scraper.md

Lines changed: 36 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,8 @@ The scraper, located in the `apps/scraper` directory, is a critical component of
1616

1717
The scraping process is designed to be robust and resilient:
1818

19+
### Manual Scraping (Programs & Courses)
20+
1921
1. **Admin Trigger**: Admin users can trigger scraping through the Convex backend by calling dedicated actions:
2022
- `api.scraper.triggerMajorsScraping` - Initiates major (program) discovery
2123
- `api.scraper.triggerCoursesScraping` - Initiates course discovery
@@ -29,11 +31,44 @@ The scraping process is designed to be robust and resilient:
2931
7. **Data Upsert**: The scraped data is then sent to the Convex backend via authenticated HTTP requests to be stored in the main database.
3032
8. **Error Handling**: The system includes error logging and a retry mechanism for failed jobs.
3133

34+
### Automated Scraping (Course Offerings)
35+
36+
Course offerings (class sections with schedule details) are scraped automatically via a scheduled Cloudflare Worker cronjob:
37+
38+
1. **Scheduled Trigger**: The worker runs on a schedule defined in `wrangler.jsonc` to check for new course offerings.
39+
2. **App Config Check**: The worker reads the following configuration from Convex:
40+
- `is_scrape_current` - Boolean flag to enable scraping current term
41+
- `is_scrape_next` - Boolean flag to enable scraping next term
42+
- `current_term` / `current_year` - Identifies the current academic term
43+
- `next_term` / `next_year` - Identifies the next academic term
44+
3. **Discovery Jobs**: For each enabled term, the worker creates a `discover-course-offerings` job that scrapes Albert's public search to find all course offering URLs.
45+
4. **Individual Jobs**: Each discovered course offering URL becomes a `course-offering` job in the queue.
46+
5. **Data Processing**: The worker scrapes details such as class number, section, instructor, schedule, location, and enrollment status.
47+
6. **Backend Sync**: Scraped course offerings are sent to Convex via the `/api/courseOfferings/upsert` endpoint in batches.
48+
49+
## Job Types
50+
51+
The scraper supports the following job types, tracked in the D1 database:
52+
53+
| Job Type | Description |
54+
|----------|-------------|
55+
| `discover-programs` | Discovers all program URLs from the bulletin |
56+
| `discover-courses` | Discovers all course URLs from the bulletin |
57+
| `discover-course-offerings` | Discovers course offering URLs from Albert public search for a specific term/year |
58+
| `program` | Scrapes detailed data for a single program |
59+
| `course` | Scrapes detailed data for a single course |
60+
| `course-offering` | Scrapes detailed data for a single course offering (section) |
61+
62+
Jobs can include metadata (stored as JSON) to pass contextual information such as the academic term and year.
63+
3264
## Project Structure
3365

3466
The scraper's code is organized as follows:
3567

3668
- `src/index.ts`: The main entry point for the Cloudflare Worker, including the scheduled and queue handlers.
3769
- `src/drizzle/`: The Drizzle ORM schema and database connection setup.
3870
- `src/lib/`: Core libraries for interacting with Convex and managing the job queue.
39-
- `src/modules/`: The logic for discovering and scraping courses and programs.
71+
- `src/modules/`: The logic for discovering and scraping courses, programs, and course offerings.
72+
- `programs/`: Program discovery and scraping logic
73+
- `courses/`: Course discovery and scraping logic
74+
- `courseOfferings/`: Course offering discovery and scraping logic (in progress)

apps/scraper/src/drizzle/schema.ts

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,16 @@ export const jobs = sqliteTable("jobs", {
1212
.notNull()
1313
.default("pending"),
1414
jobType: text("job_type", {
15-
enum: ["discover-programs", "discover-courses", "program", "course"],
15+
enum: [
16+
"discover-programs",
17+
"discover-courses",
18+
"discover-course-offerings",
19+
"program",
20+
"course",
21+
"course-offering",
22+
],
1623
}).notNull(),
24+
metadata: text("metadata", { mode: "json" }),
1725
createdAt: integer("created_at", { mode: "timestamp" })
1826
.notNull()
1927
.$defaultFn(() => new Date()),

apps/scraper/src/index.ts

Lines changed: 119 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,10 @@ import getDB from "./drizzle";
55
import { errorLogs, jobs } from "./drizzle/schema";
66
import { ConvexApi } from "./lib/convex";
77
import { JobError, type JobMessage } from "./lib/queue";
8+
import {
9+
discoverCourseOfferings,
10+
scrapeCourseOfferings,
11+
} from "./modules/courseOfferings";
812
import { discoverCourses, scrapeCourse } from "./modules/courses";
913
import { discoverPrograms, scrapeProgram } from "./modules/programs";
1014

@@ -80,14 +84,77 @@ app.post("/api/courses", validateApiKey, async (c) => {
8084
export default {
8185
fetch: app.fetch,
8286

83-
async scheduled(_event: ScheduledEvent, _env: CloudflareBindings) {
84-
// const db = getDB(env);
85-
// const convex = new ConvexApi({
86-
// baseUrl: env.CONVEX_SITE_URL,
87-
// apiKey: env.CONVEX_API_KEY,
88-
// });
89-
// TODO: add albert public search
90-
return;
87+
async scheduled(_event: ScheduledEvent, env: CloudflareBindings) {
88+
const db = getDB(env);
89+
const convex = new ConvexApi({
90+
baseUrl: env.CONVEX_SITE_URL,
91+
apiKey: env.CONVEX_API_KEY,
92+
});
93+
94+
// Get scraping flags from Convex app config
95+
const isScrapeCurrentData = await convex.getAppConfig({
96+
key: "is_scrape_current",
97+
});
98+
const isScrapeNextData = await convex.getAppConfig({
99+
key: "is_scrape_next",
100+
});
101+
102+
const isScrapeCurrent = isScrapeCurrentData === "true";
103+
const isScrapeNext = isScrapeNextData === "true";
104+
105+
console.log(
106+
`Cronjob: Scraping flags - current: ${isScrapeCurrent}, next: ${isScrapeNext}`,
107+
);
108+
109+
// Collect terms to scrape
110+
const termsToScrape: Array<{
111+
term: "spring" | "summer" | "fall" | "j-term";
112+
year: number;
113+
}> = [];
114+
115+
if (isScrapeCurrent) {
116+
const currentTerm = (await convex.getAppConfig({
117+
key: "current_term",
118+
})) as "spring" | "summer" | "fall" | "j-term";
119+
const currentYearStr = await convex.getAppConfig({ key: "current_year" });
120+
if (currentYearStr) {
121+
const currentYear = Number.parseInt(currentYearStr, 10);
122+
termsToScrape.push({ term: currentTerm, year: currentYear });
123+
}
124+
}
125+
126+
if (isScrapeNext) {
127+
const nextTerm = (await convex.getAppConfig({ key: "next_term" })) as
128+
| "spring"
129+
| "summer"
130+
| "fall"
131+
| "j-term";
132+
const nextYearStr = await convex.getAppConfig({ key: "next_year" });
133+
if (nextYearStr) {
134+
const nextYear = Number.parseInt(nextYearStr, 10);
135+
termsToScrape.push({ term: nextTerm, year: nextYear });
136+
}
137+
}
138+
139+
// Trigger course offerings discovery for each enabled term
140+
const courseOfferingsUrl = new URL(env.SCRAPING_BASE_URL).toString();
141+
142+
for (const { term, year } of termsToScrape) {
143+
const [createdJob] = await db
144+
.insert(jobs)
145+
.values({
146+
url: courseOfferingsUrl,
147+
jobType: "discover-course-offerings",
148+
metadata: { term, year },
149+
})
150+
.returning();
151+
152+
await env.SCRAPING_QUEUE.send({ jobId: createdJob.id });
153+
154+
console.log(
155+
`Cronjob: Created course offerings discovery job [id: ${createdJob.id}, term: ${term}, year: ${year}]`,
156+
);
157+
}
91158
},
92159

93160
async queue(
@@ -186,6 +253,50 @@ export default {
186253
}
187254
break;
188255
}
256+
case "discover-course-offerings": {
257+
const metadata = job.metadata as {
258+
term: "spring" | "summer" | "fall" | "j-term";
259+
year: number;
260+
} | null;
261+
262+
if (!metadata?.term || !metadata?.year) {
263+
throw new JobError(
264+
"Missing term or year in job metadata",
265+
"validation",
266+
);
267+
}
268+
269+
const courseOfferingUrls = await discoverCourseOfferings(
270+
job.url,
271+
metadata.term,
272+
metadata.year,
273+
);
274+
const newJobs = await db
275+
.insert(jobs)
276+
.values(
277+
courseOfferingUrls.map((url) => ({
278+
url,
279+
jobType: "course-offering" as const,
280+
metadata: { term: metadata.term, year: metadata.year },
281+
})),
282+
)
283+
.returning();
284+
285+
await env.SCRAPING_QUEUE.sendBatch(
286+
newJobs.map((j) => ({ body: { jobId: j.id } })),
287+
);
288+
break;
289+
}
290+
case "course-offering": {
291+
const courseOfferings = await scrapeCourseOfferings(
292+
job.url,
293+
db,
294+
env,
295+
);
296+
297+
await convex.upsertCourseOfferings(courseOfferings);
298+
break;
299+
}
189300
}
190301

191302
await db

apps/scraper/src/lib/convex.ts

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ import type { internal } from "@albert-plus/server/convex/_generated/api";
22
import {
33
ZGetAppConfig,
44
type ZSetAppConfig,
5-
ZUpsertCourseOffering,
5+
ZUpsertCourseOfferings,
66
ZUpsertCourseWithPrerequisites,
77
ZUpsertProgramWithRequirements,
88
} from "@albert-plus/server/convex/http";
@@ -73,12 +73,12 @@ export class ConvexApi {
7373
return res.data;
7474
}
7575

76-
async upsertCourseOffering(data: z.infer<typeof ZUpsertCourseOffering>) {
76+
async upsertCourseOfferings(data: z.infer<typeof ZUpsertCourseOfferings>) {
7777
const res = await this.request<
7878
FunctionReturnType<
7979
typeof internal.courseOfferings.upsertCourseOfferingInternal
8080
>
81-
>("/api/courseOfferings/upsert", ZUpsertCourseOffering, data);
81+
>("/api/courseOfferings/upsert", ZUpsertCourseOfferings, data);
8282
return res.data;
8383
}
8484

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
/** biome-ignore-all lint/correctness/noUnusedFunctionParameters: bypass for now */
2+
import type { ZUpsertCourseOfferings } from "@albert-plus/server/convex/http";
3+
import type { DrizzleD1Database } from "drizzle-orm/d1";
4+
import type * as z from "zod/mini";
5+
6+
export type CourseOfferingData = z.infer<typeof ZUpsertCourseOfferings>;
7+
8+
export async function discoverCourseOfferings(
9+
url: string,
10+
term: "spring" | "summer" | "fall" | "j-term",
11+
year: number,
12+
): Promise<string[]> {
13+
// TODO: implement this function to scrape the Albert public search listing
14+
// This should extract all course URLs from the search page for the given term/year
15+
// Example: returns ["https://albert.../CSCI-UA-101?term=spring&year=2025", ...]
16+
return [];
17+
}
18+
19+
export async function scrapeCourseOfferings(
20+
url: string,
21+
db: DrizzleD1Database,
22+
env: CloudflareBindings,
23+
): Promise<CourseOfferingData> {
24+
// TODO: implement this function to scrape a single course page
25+
throw new Error("Not implemented");
26+
}

0 commit comments

Comments
 (0)