Skip to content

Optimize Dictionary Import Memory Use#2319

Open
Casheeew wants to merge 11 commits intomasterfrom
perf-optimize-import-memory
Open

Optimize Dictionary Import Memory Use#2319
Casheeew wants to merge 11 commits intomasterfrom
perf-optimize-import-memory

Conversation

@Casheeew
Copy link
Member

@Casheeew Casheeew commented Feb 25, 2026

The massive memory usage by loading term files all into memory may cause Chrome to be slower and sometimes have issues such as #1420 .

This PR instead achieves a constant memory usage via using a streaming pattern:
This PR uses zip.js' getData() stream to pipe into a custom JSON parser, and emits onBatch events whenever there are maxTransactionLength entries. This effectively reduces the memory usage by half even when the term files are small (at 1000 terms / term bank), and caps the memory usage at about 40 MB.

works flawlessly even when combining pixiv full into one humongous term bank

Based on my comprehensive testing, there is negligible difference in import time. this is slightly faster due to less GC

Maybe a partial fix to the many "Unknown error" issues... #536

@Casheeew Casheeew requested a review from a team as a code owner February 25, 2026 13:26
@Casheeew
Copy link
Member Author

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b1694dafcd

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@Casheeew
Copy link
Member Author

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 00ca6997b5

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1981762c53

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Casheeew and others added 4 commits February 25, 2026 23:50
The test directories were gitignored by the `dictionaries/` rule
and needed to be force-added, matching how existing fixtures are tracked.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Casheeew
Copy link
Member Author

Should be good to go. We don't have to cover all edge cases since realistically dictionaries should be for the large part generated by code, so there is little reason to check every malformed JSON. The main check is the AJV layer and if the dictionary passes, it should not bork the user in any way with malformed data.

@Casheeew
Copy link
Member Author

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fdf7851b2b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@Kuuuube
Copy link
Member

Kuuuube commented Feb 26, 2026

Based on my comprehensive testing, there is negligible difference in import time. this is slightly faster due to less GC

Can you give a few timed tests of imports and the peak memory usage youre seeing before and after?

@Casheeew
Copy link
Member Author

Okay, I tested again. For dictionaries which term files have 1000 entries the performance is comparable (both versions are loading 1000 entries into memory, I probably got the impression that it was halved from a wrong tab..)
import time is quite high variance (few seconds for EN-KRDICT, >10s for Pixiv)

PixivFull before (210.54s)
validation step
image
import step
image

PixivFull after (203.00s)
validation step
image
import step
image

for EN-KRDICT where each term file has a different number of entries, the memory is effectively halved:

Before (12.41s):
image
After (13.20s):
image

bonus: Pixiv with 1 term bank (700k+ entries)

Before: ∞ seconds
import error
After: 205.44s
validation step
image
import step
image

basically equivalent to when split into 1000 entry term banks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants