Skip to content

Commit 90b3efe

Browse files
xleonx0xRyanYenschmt-fnsgabriellaliantiYalilix
authored
rewrite: new scraper (#69)
* added the room collecting sessions scraper, failed to upload to database * feat: added xlsx fetcher * feat: added xlsx decoder * feat: add scrapeTTRooms * feat: add sessionIdentity JSON file * feat: convert BookingsExcelRow interface to RawrRoomBookings * feat: add session id scrape script * update collectSessionIdentities to save to json: * fix: session identities json to actual timetable site * chore: fix fetcher and decoder for new site * chore: uncomment file removal * chore: added room id parser * chore: fix up regex to handle any case for the room * feat: add parseBookingRow and time parsers * fix: fixed stalling when lots of bookings loaded * feat: added env key for booking loading timeout * style: ran formatter on fetchXlsx * style: removed trailing whitespace from parsebookingrow * fix: added options to url so it fetches entire year and covers all days of week and entire day * chore: added type for room session identity * chore: set headless to true so 20 browsers dont show up * refactor: saved bookings to json and refactored to use promises * chore: add return type for getBookings * fix: fixed scraper crashing when no bookings found * updated session scraper to do all of the different rooms * fix: make path absolute * chore: added session identities * feat: adjusted nameParsers to match the new website, updated parseBookingRow to include booktingType and name from nameParse() function * made the Publish bookings upload to the api * feat: adjusted exam type pattern and parser * chore: add function comments for parseBookingRow * feat: adjusted block type bug for name parser * refactor: refactored main logic to be cleaner and functional * chore: moved parseRoomIds fn to parseBookingRow * refactor: fetching excel rows is now done in-memory using buffers instead of writing to disk * refactor: collect session identities now uses absolute path and caches itself * chore: moved cache logic into collect session identities * chore: moved cache logic into collect session identities * chore: modified bookings loading timeout to 100 seconds * chore: added a constant for the file path of session identities * chore: cleaned up src * chore: clean up and hook new bookings scraper up to run scrape * chore: remove scrape-publish command * chore: change function comments for parseBookingRow * chore: added comments excel scraping functions and scrape bookings fn * chore: cleaned up export and added some comments * chore: add EXAM as a booking type to schema * added comments to the session collecting files * feat: added parsing for FS (Foundation Studies) names * fix: added a negative lookahead for MISC_CLASS type to prevent BLOCK type to match as MISC_CLASS type * refactor: rewrite scraping logic to use publish api * chore: add a getter for the day manually * chore: type collect session identities fn * chore: modify booking excel row interface to match api * refactor: fully fleshed out api now courtesy of franco * chore: removed excel scraper * chore: remove unused files * chore: clean up and move types to types.ts * chore: move constants to top of file * chore: remove session identity json * feat: added rate limiting to requests * refactor: added a publish api client * chore: change logging message and fix var name * chore: fix import issue * chore: added some comments * chore: remove constants file * chore: change time between requests to 1 second * chore: removed unused packages xlsx and playwright * docs: update readme for new publish scraper rewrite * fix: reordered parsers, tighten MISC_CLASS, add ^ anchors to INTERNAL_SOCIETY, SOCIETY, AND INTERNAL patterns * fix: removed the LLL suffix from the MISC type in MISC_CLASS * fix: fix regex for parsing roomids in bookings * fix: add filter for bookings for rooms that aren't fetched * fix: replaced nss scrapes with hard coded data, since nss is now 403 rip * chore: removed unused files * chore: update package lock json * chore: removed buildings and rooms from gitignore artifacts * chore: add rooms and buildings hardcoded nss data * chore: remove unused package command --------- Co-authored-by: Ryan Yensch <ry2005@outlook.com> Co-authored-by: Michael Tanto <michaeltanto0909@gmail.com> Co-authored-by: Gabriella L <134288268+gabriellalianti@users.noreply.github.com> Co-authored-by: Yanlin Li <yanauslin8@gmail.com>
1 parent 4fd6aeb commit 90b3efe

24 files changed

+17147
-596
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@ Mapping of school codes can be found [here](https://github.com/devsoc-unsw/freer
5454
| `start` | Start time of the booking. | "2024-01-27T04:00:00+00:00" |
5555
| `end` | End time of the booking. | "2024-01-27T08:00:00+00:00" |
5656

57-
Full list of current booking types is: "CLASS", "SOCIETY", "INTERNAL", "LIB", "BLOCK", "MISC".
57+
Full list of current booking types is: "CLASS", "SOCIETY", "INTERNAL", "LIB", "BLOCK", "MISC", "EXAM".
5858

5959
### Relationships
6060

nss/.gitignore

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -124,5 +124,3 @@ database.json
124124

125125
# artifacts
126126
bookings.json
127-
buildings.json
128-
rooms.json

nss/README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
# nss-scraper
2-
Scraper for https://nss.cse.unsw.edu.au/tt/. This scrapes the data for all buildings and rooms managed by the UNSW central timetabling system, as well as all bookings on this system. This includes classes and society bookings as well as others.
2+
Scraper for https://nss.cse.unsw.edu.au/tt/ and https://publish.unsw.edu.au/. This scrapes the data for all buildings and rooms managed by the UNSW central timetabling system, as well as all bookings from Publish. This includes classes and society bookings as well as others.
33

44
## Data Sources
55

@@ -9,6 +9,6 @@ Room data is all scraped from doing a search on https://nss.cse.unsw.edu.au/tt/f
99

1010
For each room, the facilities are scraped from https://nss.cse.unsw.edu.au/tt/find_rooms.php?dbafile=2024-KENS-COFA.DBA&campus=KENS (same link as above). For the facilities to appear, you need to pass in `show: "show_facilities"` and `room: roomId` in the request body.
1111

12-
Bookings are scraped separately for each room from the individual room pages (e.g. https://nss.cse.unsw.edu.au/tt/view_rooms.php?dbafile=2024-KENS-COFA.DBA&campus=KENS). By setting the date range to be the whole year, each booking will show a bit string on hover (HTML `title` element) describing which weeks of the year it runs in.
12+
Bookings are scraped separately from Publish through their public API (https://publish.unsw.edu.au/timetables?view=week).
1313

14-
Some buildings and rooms are ignored, which can be seen and configured in `src/exclusions.ts`.
14+
Some buildings and rooms are ignored, which can be seen and configured in `src/exclusions.ts`.

nss/nss_data/buildings.json

Lines changed: 320 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,320 @@
1+
[
2+
{
3+
"name": "AGSM",
4+
"id": "K-G27",
5+
"lat": -33.91852,
6+
"long": 151.235664,
7+
"aliases": []
8+
},
9+
{
10+
"name": "Ainsworth Building",
11+
"id": "K-J17",
12+
"lat": -33.918527,
13+
"long": 151.23126,
14+
"aliases": []
15+
},
16+
{
17+
"name": "Biological Sciences (West)",
18+
"id": "K-D26",
19+
"lat": -33.917381,
20+
"long": 151.235403,
21+
"aliases": []
22+
},
23+
{
24+
"name": "Biological Sciences",
25+
"id": "K-E26",
26+
"lat": -33.917682,
27+
"long": 151.235736,
28+
"aliases": []
29+
},
30+
{
31+
"name": "Blockhouse",
32+
"id": "K-G6",
33+
"lat": -33.916937,
34+
"long": 151.226826,
35+
"aliases": []
36+
},
37+
{
38+
"name": "Civil Engineering Building",
39+
"id": "K-H20",
40+
"lat": -33.918234,
41+
"long": 151.232835,
42+
"aliases": []
43+
},
44+
{
45+
"name": "Colombo Building",
46+
"id": "K-B16",
47+
"lat": -33.915902,
48+
"long": 151.231367,
49+
"aliases": []
50+
},
51+
{
52+
"name": "Computer Science & Eng (K17)",
53+
"id": "K-K17",
54+
"lat": -33.918929,
55+
"long": 151.231055,
56+
"aliases": []
57+
},
58+
{
59+
"name": "Dalton Building",
60+
"id": "K-F12",
61+
"lat": -33.917502,
62+
"long": 151.229353,
63+
"aliases": []
64+
},
65+
{
66+
"name": "Patricia OShane",
67+
"id": "K-E19",
68+
"lat": -33.917177,
69+
"long": 151.232511,
70+
"aliases": [
71+
"Central Lecture Block"
72+
]
73+
},
74+
{
75+
"name": "Electrical Engineering Bldg",
76+
"id": "K-G17",
77+
"lat": -33.91779,
78+
"long": 151.2315,
79+
"aliases": []
80+
},
81+
{
82+
"name": "Esme Timbery Creative Practice",
83+
"id": "K-D8",
84+
"lat": 0,
85+
"long": 0,
86+
"aliases": []
87+
},
88+
{
89+
"name": "June Griffith",
90+
"id": "K-F10",
91+
"lat": -33.917108,
92+
"long": 151.228826,
93+
"aliases": [
94+
"Chemical Sciences"
95+
]
96+
},
97+
{
98+
"name": "Gordon Jacqueline Samuels",
99+
"id": "K-F25",
100+
"lat": -33.917892,
101+
"long": 151.235139,
102+
"aliases": []
103+
},
104+
{
105+
"name": "Anita B Lawrence Centre",
106+
"id": "K-H13",
107+
"lat": -33.917876,
108+
"long": 151.230057,
109+
"aliases": [
110+
"Red Centre"
111+
]
112+
},
113+
{
114+
"name": "Health Translation Hub",
115+
"id": "K-C29",
116+
"lat": 0,
117+
"long": 0,
118+
"aliases": []
119+
},
120+
{
121+
"name": "Integrated Acute Services Bldg",
122+
"id": "K-F31",
123+
"lat": 0,
124+
"long": 0,
125+
"aliases": []
126+
},
127+
{
128+
"name": "John Goodsell Building",
129+
"id": "K-F20",
130+
"lat": -33.917492,
131+
"long": 151.232712,
132+
"aliases": []
133+
},
134+
{
135+
"name": "John Niland Scientia",
136+
"id": "K-G19",
137+
"lat": -33.918,
138+
"long": 151.23239,
139+
"aliases": []
140+
},
141+
{
142+
"name": "Keith Burrows Theatre",
143+
"id": "K-J14",
144+
"lat": -33.918207,
145+
"long": 151.230109,
146+
"aliases": []
147+
},
148+
{
149+
"name": "Law Building",
150+
"id": "K-F8",
151+
"lat": -33.917004,
152+
"long": 151.227791,
153+
"aliases": [
154+
"Law Library"
155+
]
156+
},
157+
{
158+
"name": "Main Library",
159+
"id": "K-F21",
160+
"lat": -33.917528,
161+
"long": 151.233439,
162+
"aliases": []
163+
},
164+
{
165+
"name": "Material Science & Engineering",
166+
"id": "K-E10",
167+
"lat": -33.916458,
168+
"long": 151.228459,
169+
"aliases": [
170+
"Hilmer Building"
171+
]
172+
},
173+
{
174+
"name": "Mathews Building",
175+
"id": "K-F23",
176+
"lat": -33.917741,
177+
"long": 151.234563,
178+
"aliases": []
179+
},
180+
{
181+
"name": "Mathews Theatres",
182+
"id": "K-D23",
183+
"lat": -33.917178,
184+
"long": 151.234177,
185+
"aliases": []
186+
},
187+
{
188+
"name": "Morven Brown Building",
189+
"id": "K-C20",
190+
"lat": -33.916792,
191+
"long": 151.232828,
192+
"aliases": []
193+
},
194+
{
195+
"name": "Newton Building",
196+
"id": "K-J12",
197+
"lat": -33.91812,
198+
"long": 151.229211,
199+
"aliases": []
200+
},
201+
{
202+
"name": "Old Main Building",
203+
"id": "K-K15",
204+
"lat": -33.918507,
205+
"long": 151.229457,
206+
"aliases": []
207+
},
208+
{
209+
"name": "Physics Theatre",
210+
"id": "K-K14",
211+
"lat": -33.918514,
212+
"long": 151.230082,
213+
"aliases": []
214+
},
215+
{
216+
"name": "Quadrangle Building",
217+
"id": "K-E15",
218+
"lat": -33.917264,
219+
"long": 151.230955,
220+
"aliases": []
221+
},
222+
{
223+
"name": "Rex Vowels Theatre",
224+
"id": "K-F17",
225+
"lat": -33.917536,
226+
"long": 151.231493,
227+
"aliases": []
228+
},
229+
{
230+
"name": "Rupert Myers Building",
231+
"id": "K-M15",
232+
"lat": -33.919533,
233+
"long": 151.230552,
234+
"aliases": []
235+
},
236+
{
237+
"name": "Science & Engineering Building",
238+
"id": "K-E8",
239+
"lat": -33.91654,
240+
"long": 151.227689,
241+
"aliases": []
242+
},
243+
{
244+
"name": "Science Theatre",
245+
"id": "K-F13",
246+
"lat": -33.917204,
247+
"long": 151.229831,
248+
"aliases": []
249+
},
250+
{
251+
"name": "Sir John Clancy Auditorium",
252+
"id": "K-C24",
253+
"lat": -33.91696,
254+
"long": 151.234542,
255+
"aliases": []
256+
},
257+
{
258+
"name": "Squarehouse",
259+
"id": "K-E4",
260+
"lat": -33.916185,
261+
"long": 151.226284,
262+
"aliases": []
263+
},
264+
{
265+
"name": "Tyree Energy Technology",
266+
"id": "K-H6",
267+
"lat": -33.917721,
268+
"long": 151.226897,
269+
"aliases": []
270+
},
271+
{
272+
"name": "Business School",
273+
"id": "K-E12",
274+
"lat": -33.9168,
275+
"long": 151.229611,
276+
"aliases": []
277+
},
278+
{
279+
"name": "Goldstein",
280+
"id": "K-D16",
281+
"lat": -33.916585,
282+
"long": 151.231394,
283+
"aliases": []
284+
},
285+
{
286+
"name": "Vallentine Annexe",
287+
"id": "K-H22",
288+
"lat": -33.918306,
289+
"long": 151.233301,
290+
"aliases": []
291+
},
292+
{
293+
"name": "Wallace Wurth Building",
294+
"id": "K-C27",
295+
"lat": -33.916744,
296+
"long": 151.235681,
297+
"aliases": []
298+
},
299+
{
300+
"name": "Webster Building",
301+
"id": "K-G14",
302+
"lat": -33.91764,
303+
"long": 151.230611,
304+
"aliases": []
305+
},
306+
{
307+
"name": "Webster Theatres",
308+
"id": "K-G15",
309+
"lat": -33.917435,
310+
"long": 151.230668,
311+
"aliases": []
312+
},
313+
{
314+
"name": "Willis Annexe",
315+
"id": "K-J18",
316+
"lat": -33.918778,
317+
"long": 151.231675,
318+
"aliases": []
319+
}
320+
]

0 commit comments

Comments
 (0)