Skip to content

Conversation

janbuchar
Copy link
Contributor

Apify reports CPU utilization as the sum of utilization of all CPUs, but Crawlee expects a number between 0 and 1. Because of this, it is impossible for AutoscaledPool to scale beyond one CPU.

@janbuchar janbuchar added adhoc Ad-hoc unplanned task added during the sprint. t-tooling Issues with this label are in the ownership of the tooling team. labels Mar 26, 2025
@janbuchar janbuchar requested review from vdusek and Pijukatel March 26, 2025 17:07
@github-actions github-actions bot added this to the 111th sprint - Tooling team milestone Mar 26, 2025
@vdusek vdusek force-pushed the fix-cpu-utilization-calculation branch from f388d1f to b63a243 Compare March 27, 2025 07:41
Copy link
Contributor

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@Pijukatel Pijukatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just something to think about: Is this scaling based on multiple cpus load working well in python context?

I think that there we are mostly running on one core all the time, not sure if there are some load relevant sub-processes. So scaling based on other cores being not utilized does not seem very compatible with current somewhat single process crawlee architecture.

@janbuchar
Copy link
Contributor Author

Just something to think about: Is this scaling based on multiple cpus load working well in python context?

Well, about the same as in Javascript context really 🤷

I think that there we are mostly running on one core all the time, not sure if there are some load relevant sub-processes. So scaling based on other cores being not utilized does not seem very compatible with current somewhat single process crawlee architecture.

Playwright is the most relevant one, I suppose. For HTTP-based crawlers, I agree that we probably won't be able to utilize more than one core. But also, most people probably won't try to run those on thicker units than your standard 4GB one.

@janbuchar janbuchar merged commit eb4c8e4 into master Mar 27, 2025
27 checks passed
@janbuchar janbuchar deleted the fix-cpu-utilization-calculation branch March 27, 2025 09:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
adhoc Ad-hoc unplanned task added during the sprint. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants