Skip to content

refactor(scraper): fix, optimize, refactor#207

Merged
mini-bomba merged 10 commits intomainfrom
refactor/scraper
Feb 15, 2025
Merged

refactor(scraper): fix, optimize, refactor#207
mini-bomba merged 10 commits intomainfrom
refactor/scraper

Conversation

@mini-bomba
Copy link
Copy Markdown
Member

@mini-bomba mini-bomba commented Feb 14, 2025

this pr does the following to the scraper command:

  • split the main run() function into multiple task functions
  • added fancy progress tracking logs with task timings
  • task functions may share data between them using properties of the command object
  • tasks should batch updates and run them all in a few queries by using the *Many() method variants of lucid models
  • implemented simple async semaphores for ratelimiting
  • the number of running parallel fetch and DB tasks is limited and can be adjusted using commandline flags
  • it actually works now (on my machine) (except it kinda doesnt) (now it does)
  • reimplemented the archive task in raw SQL

todo: (required before merge)
figure out why courses are always being duplicated on each scrape done

further optimization possibilities:
the archive task could probably be rewritten in raw sql for better performance done

@mini-bomba mini-bomba marked this pull request as ready for review February 15, 2025 16:57
Copy link
Copy Markdown
Member

@qamarq qamarq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rozjezdzaja sie prowadzacy z kursami

- replaced export {} syntax with marking each function as export at
  declaration
- exported all interfaces
- refactored all scrap*() arrow functions to proper functions with
  proper return typing
- added some new interfaces for return types of scrap*() functions
- made all scrap*() functions throw improved errors instead of returning
  undefined
- split the main `run()` function into multiple task functions
- task functions may share data between them using properties of the
  command object
- tasks should batch updates and run them all in a few queries by using
  the `*Many()` method variants of lucid models
- implemented simple async semaphores for ratelimiting
- the number of running parallel fetch and DB tasks is limited and can
  be adjusted using commandline flags
- it actually works now (on my machine)
results in a ~40% speedup in that task (~25s -> ~15s)
i need that Set.difference in my scraper
today's session of pointless debugging was brought to you by today's
sponsor, adonis!
do you want your code to absolutely explode every time you attempt to do
a bulk SQL action?
do you despise the common-sense assumptions, such as the bulk fetch
function returning items in the same order as in the list you provided?
do you like wasting hours sitting in the debugger, inventing new
debugging techniques, such as setting a conditional breakpoint on
`Math.random() < 0.001`?
then adonis is perfect for you!
rewrite your web project in adonis today!
use promo code `mini_bomba` to get 50% more pointless debugging for your
first rewrite and a free database implosion on your first tests in
production!
@D0dii
Copy link
Copy Markdown
Member

D0dii commented Feb 15, 2025

Looks good to me. Can't find any issues

Copy link
Copy Markdown
Member

@qamarq qamarq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

juz ladnie smiga <3

Copy link
Copy Markdown
Member

@simon-the-shark simon-the-shark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nie wczytywałem się bardzo dokładnie w logikę samego scrapowania, ale zakładam że jak działa to działa.

Kodzik bardzo ładny z dodatkiem kraftowych elementów :P

Zostawiłem parę nitpicków, ale równie dobrze to można to mergować jak chcecie

- replace () => {return {...};} with () => ({...})
- remove commented-out code
- move utils to their own files in /app/utils
  - create '#utils/' subpath imports for /app/utils
@mini-bomba mini-bomba merged commit 0465768 into main Feb 15, 2025
2 checks passed
@mini-bomba mini-bomba deleted the refactor/scraper branch February 15, 2025 22:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants