refactor(scraper): fix, optimize, refactor by mini-bomba · Pull Request #207 · Solvro/web-planer

mini-bomba · 2025-02-14T21:38:46Z

this pr does the following to the scraper command:

split the main run() function into multiple task functions
added fancy progress tracking logs with task timings
task functions may share data between them using properties of the command object
tasks should batch updates and run them all in a few queries by using the *Many() method variants of lucid models
implemented simple async semaphores for ratelimiting
the number of running parallel fetch and DB tasks is limited and can be adjusted using commandline flags
it actually works now (on my machine) ~~(except it kinda doesnt)~~ (now it does)
reimplemented the archive task in raw SQL

todo: (required before merge)
~~figure out why courses are always being duplicated on each scrape~~ done

further optimization possibilities:
~~the archive task could probably be rewritten in raw sql for better performance~~ done

qamarq

rozjezdzaja sie prowadzacy z kursami

- replaced export {} syntax with marking each function as export at declaration - exported all interfaces - refactored all scrap*() arrow functions to proper functions with proper return typing - added some new interfaces for return types of scrap*() functions - made all scrap*() functions throw improved errors instead of returning undefined

- split the main `run()` function into multiple task functions - task functions may share data between them using properties of the command object - tasks should batch updates and run them all in a few queries by using the `*Many()` method variants of lucid models - implemented simple async semaphores for ratelimiting - the number of running parallel fetch and DB tasks is limited and can be adjusted using commandline flags - it actually works now (on my machine)

results in a ~40% speedup in that task (~25s -> ~15s)

i need that Set.difference in my scraper

today's session of pointless debugging was brought to you by today's sponsor, adonis! do you want your code to absolutely explode every time you attempt to do a bulk SQL action? do you despise the common-sense assumptions, such as the bulk fetch function returning items in the same order as in the list you provided? do you like wasting hours sitting in the debugger, inventing new debugging techniques, such as setting a conditional breakpoint on `Math.random() < 0.001`? then adonis is perfect for you! rewrite your web project in adonis today! use promo code `mini_bomba` to get 50% more pointless debugging for your first rewrite and a free database implosion on your first tests in production!

D0dii · 2025-02-15T19:57:04Z

Looks good to me. Can't find any issues

qamarq

juz ladnie smiga <3

simon-the-shark

nie wczytywałem się bardzo dokładnie w logikę samego scrapowania, ale zakładam że jak działa to działa.

Kodzik bardzo ładny z dodatkiem kraftowych elementów :P

Zostawiłem parę nitpicków, ale równie dobrze to można to mergować jak chcecie

backend/commands/scraper.ts

- replace () => {return {...};} with () => ({...}) - remove commented-out code - move utils to their own files in /app/utils - create '#utils/' subpath imports for /app/utils

mini-bomba force-pushed the refactor/scraper branch from eec0ac9 to c88956b Compare February 15, 2025 15:56

mini-bomba marked this pull request as ready for review February 15, 2025 16:57

mini-bomba force-pushed the refactor/scraper branch from bec1ae2 to 35f6e6c Compare February 15, 2025 16:59

qamarq requested changes Feb 15, 2025

View reviewed changes

mini-bomba added 5 commits February 15, 2025 19:39

fix(scraper): eliminate duplicate rows using a constraint

40aa70e

refactor(scraper): rewrite the archive task in raw SQL

a31158c

results in a ~40% speedup in that task (~25s -> ~15s)

ci: bump node version to 22

4505792

i need that Set.difference in my scraper

mini-bomba force-pushed the refactor/scraper branch from 896bfa2 to 4505792 Compare February 15, 2025 18:39

mini-bomba added 2 commits February 15, 2025 19:40

fix(migrations): keep the lowest ID duplicate group instead of highest

0c7dba0

unewMe requested review from D0dii and simon-the-shark February 15, 2025 19:48

mini-bomba requested a review from qamarq February 15, 2025 19:49

mini-bomba force-pushed the refactor/scraper branch from a21da1c to 552b03f Compare February 15, 2025 20:37

fix(frontend): fix connection to api on dev mode

5008a67

qamarq approved these changes Feb 15, 2025

View reviewed changes

feat(scraper): vacuum & analyze tables after scrape

8949123

mini-bomba force-pushed the refactor/scraper branch from c87cd09 to 8949123 Compare February 15, 2025 20:43

simon-the-shark reviewed Feb 15, 2025

View reviewed changes

refactor(scraper): minor code cleanup as suggested in code review

45d79e8

- replace () => {return {...};} with () => ({...}) - remove commented-out code - move utils to their own files in /app/utils - create '#utils/' subpath imports for /app/utils

mini-bomba force-pushed the refactor/scraper branch from aef8424 to 45d79e8 Compare February 15, 2025 22:14

mini-bomba merged commit 0465768 into main Feb 15, 2025
2 checks passed

mini-bomba deleted the refactor/scraper branch February 15, 2025 22:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(scraper): fix, optimize, refactor#207

refactor(scraper): fix, optimize, refactor#207
mini-bomba merged 10 commits intomainfrom
refactor/scraper

mini-bomba commented Feb 14, 2025 •

edited

Loading

Uh oh!

qamarq left a comment

Uh oh!

D0dii commented Feb 15, 2025

Uh oh!

qamarq left a comment

Uh oh!

simon-the-shark left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

mini-bomba commented Feb 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qamarq left a comment

Choose a reason for hiding this comment

Uh oh!

D0dii commented Feb 15, 2025

Uh oh!

qamarq left a comment

Choose a reason for hiding this comment

Uh oh!

simon-the-shark left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mini-bomba commented Feb 14, 2025 •

edited

Loading