Skip to content

Conversation

kdichev
Copy link
Contributor

@kdichev kdichev commented Jul 27, 2025

This PR adds two simple in-memory caches to speed up the HTML stitching process:

sliceCache - stores the raw HTML of each slice after it's read from disk

stitchedSliceCache - stores the fully-stitched version of each slice to avoid re-processing

With 5,000+ pages being stitched, we were re-reading and re-stitching the same slices many times. Most of our slices are reused heavily, so caching makes a big difference.

On local benchmarks, this reduced stitching time for 5000 pages, 8 slices into every page from ~300s to ~80s.

next steps:
by introducing workers, cold builds speeds improve from ~300s to ~35s

@gatsbot gatsbot bot added the status: triage needed Issue or pull request that need to be triaged and assigned to a reviewer label Jul 27, 2025
@kdichev
Copy link
Contributor Author

kdichev commented Jul 27, 2025

From my perspective, I’m working with a fairly large Gatsby site, around 5,000 active pages (originally 25,000, but I’ve removed outdated content). Each page includes 8 unique slices.

Before making changes, the slice processing step alone took about 300 seconds, which was surprisingly close to the time required for optimizing 6,000 images. I reviewed the code and introduced some performance improvements that brought the slice processing time down from ~300s to ~80s.

I also tested whether the regex logic was a bottleneck, but replacing it with a custom parser didn’t yield any speed gains. However, moving the slice queue from fastq to worker threads brought cold build time down further to 35s, and hot builds (with slice changes) to around 50s.

I noticed that my machine’s build resources weren’t being fully utilized before, but after introducing workers, usage hit full capacity.

This could be a solid further general improvement. Let me know if this is something you'd consider viable, and I’ll be happy to create add it.

@serhalp serhalp added the topic: performance Related to runtime & build performance label Aug 4, 2025
@serhalp serhalp added status: needs core review Currently awaiting review from Core team member topic: core Relates to Gatsby's core (e.g. page loading, reporter, state machine) and removed status: triage needed Issue or pull request that need to be triaged and assigned to a reviewer labels Aug 4, 2025
@pieh
Copy link
Contributor

pieh commented Aug 7, 2025

This is overall reasonable change. The main thing I worry here is that the cached content is strongly held in memory and with setup as-is it might result in out-of-memory errors in scenarios that was not happening before with sufficiently large number of slice variants and/or large slice variants content.

As this is an optimization attempt and source of data is still in files I think some kind of lru-cache OR wrapping content in WeakRef would be advised to protect against unbound growth of strongly referenced content that would prevent allocated memory from ever being reclaimed in cases of memory pressure

@kdichev
Copy link
Contributor Author

kdichev commented Aug 13, 2025

@pieh This is reasonable feedback, thanks for noting the memory issue! I honestly didn’t think about that at all. I’ve got a fairly powerful machine, so I guess it hasn’t been high on my list of concerns, heh. I’ll work on the suggestions and see how it goes. If there’s already something similar in the repo, and it’s not a burden, I’d appreciate a link so I can take inspiration and stay aligned with accepted practices here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: needs core review Currently awaiting review from Core team member topic: core Relates to Gatsby's core (e.g. page loading, reporter, state machine) topic: performance Related to runtime & build performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants