Skip to content

Conversation

@neiljohari
Copy link
Contributor

@neiljohari neiljohari commented Sep 16, 2025

Summary

  • Fix a mismatch between DiB_fileStats() and DiB_loadFiles() that could drive totalSizeToLoad negative when invalid inputs are present (e.g., broken symlinks, disappearing files, IO errors).
  • When totalSizeToLoad goes negative, it later gets cast to size_t and wraps to a huge value, leading to large allocation sizes and “not enough memory for DiB_trainFiles”.
  • In this PR, we align DiB_fileStats() with DiB_loadFiles(): both now skip non-positive file sizes.
  • See temporary debug logging and a minimal repro in 96fdb9b that demonstrates the issue clearly; fix is in 85f4a7e.

To be 100% transparent, I think you need a pretty pathological case to see crashes -- to the point where there's something severely wrong with your sample dir. I probably wouldn't put up a PR for this normally, but since it is possible to not crash but get slightly incorrect totals thought it's worth it.

Motivation and background

While investigating dictionary-training hanging with empty files, I found that my environment (conda-forge zstd labeled v1.5.2) didn’t include the upstream fix for the empty-file infinite loop (#3081). Took me a while to figure that out though, and while investigating I looked more broadly at dibio.c and noticed this possible issue.

The bug fixed in this PR did not impact me (unrelated to my empty file hang), I just happened to notice it via code inspection.

Root cause

  • DiB_getFileSize() returns:
    • file size for regular files
    • 0 for empty files
    • -1 for non-regular/unreadable files (e.g., broken symlinks, ENOENT, EACCES, transient IO errors)
  • DiB_loadFiles() correctly skips files with fileSize <= 0.
  • DiB_fileStats(), however, only skipped fileSize == 0. It counted negative sizes as samples and added MIN(fileSize, SAMPLESIZE_MAX) to totalSizeToLoad. For fileSize == -1, this subtracts 1 from totalSizeToLoad per invalid entry and inflates the expected sample count.

Impact

With enough invalid entries in the training set (or just a few if the valid samples are small), totalSizeToLoad can go negative.
The memory sizing logic casts negative totals to size_t, wrapping to a massive value. This can result in:
malloc(huge) → failure → EXM_THROW(12) “not enough memory for DiB_trainFiles”, or
minor-to-large drift between the intended and actual allocation sizes, depending on the exact inputs.

Even when it doesn’t crash, this mismatch can under/over-allocate sample memory. I'm not an expert in the trainer tool though, so I'm not sure if that really matters or if it can lead to the trained dicts being suboptimal?

Reproduction (added in 96fdb9b)

Minimal setup: a handful of small “good” files plus many nonexistent paths passed on the command line (nonexistent paths bypass symlink filtering).

The temporary debug logs show negative file sizes being counted by DiB_fileStats(), final totals going negative, and the resulting huge loadedSize after casting.

Before (current behavior)

With 5 valid samples plus many nonexistent files:

[DEBUG FINAL] fileStats: nbSamples=1005, totalSizeToLoad=-970 (NEGATIVE!)
[DEBUG] Memory calc: totalSizeToLoad=-970, maxMem=0, loadedSize=18446744073709550646 (0xfffffffffffffc36)
[BUG] totalSizeToLoad is NEGATIVE! This will cause allocation issues!
[DEBUG] About to malloc: srcBuffer size=18446744073709550678, sampleSizes array size=8040
Error 12 : not enough memory for DiB_trainFiles

After (this change, 85f4a7e):

Same scenario, after skipping fileSize <= 0 in DiB_fileStats():
[DEBUG FINAL] fileStats: nbSamples=5, totalSizeToLoad=30 (ok)
[DEBUG] Memory calc: totalSizeToLoad=30, maxMem=8388608, loadedSize=30 (0x1e)
[DEBUG] About to malloc: srcBuffer size=62, sampleSizes array size=40

The “before” output produces a negative total and a huge loadedSize, ending in EXM_THROW(12).
The “after” output shows consistent nbSamples, sane totals, and modest allocations, proceeding to training as expected.

@meta-cla
Copy link

meta-cla bot commented Sep 16, 2025

Hi @neiljohari!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

@Cyan4973 Cyan4973 self-assigned this Sep 16, 2025
Copy link
Contributor

@Cyan4973 Cyan4973 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, DiB_getFileSize() return -1 when the file doesn't exist.
Great work @neiljohari !

@Cyan4973
Copy link
Contributor

Note @neiljohari that signing the CLA is required before getting the code merged.
It's only needed once, and is then also valid for any other Meta open-source project.

@neiljohari
Copy link
Contributor Author

Note @neiljohari that signing the CLA is required before getting the code merged. It's only needed once, and is then also valid for any other Meta open-source project.

Thanks Yann! Going through our internal process for approval, will be done shortly 😄

@meta-cla meta-cla bot added the CLA Signed label Sep 20, 2025
@Cyan4973 Cyan4973 merged commit e6e5a95 into facebook:dev Sep 20, 2025
104 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants