make DiB_fileStats skip invalid files (fileSize <= 0) to prevent negative totals and bogus allocation #4487

neiljohari · 2025-09-16T07:21:41Z

Summary

Fix a mismatch between DiB_fileStats() and DiB_loadFiles() that could drive totalSizeToLoad negative when invalid inputs are present (e.g., broken symlinks, disappearing files, IO errors).
When totalSizeToLoad goes negative, it later gets cast to size_t and wraps to a huge value, leading to large allocation sizes and “not enough memory for DiB_trainFiles”.
In this PR, we align DiB_fileStats() with DiB_loadFiles(): both now skip non-positive file sizes.
See temporary debug logging and a minimal repro in 96fdb9b that demonstrates the issue clearly; fix is in 85f4a7e.

To be 100% transparent, I think you need a pretty pathological case to see crashes -- to the point where there's something severely wrong with your sample dir. I probably wouldn't put up a PR for this normally, but since it is possible to not crash but get slightly incorrect totals thought it's worth it.

Motivation and background

While investigating dictionary-training hanging with empty files, I found that my environment (conda-forge zstd labeled v1.5.2) didn’t include the upstream fix for the empty-file infinite loop (#3081). Took me a while to figure that out though, and while investigating I looked more broadly at dibio.c and noticed this possible issue.

The bug fixed in this PR did not impact me (unrelated to my empty file hang), I just happened to notice it via code inspection.

Root cause

DiB_getFileSize() returns:
- file size for regular files
- 0 for empty files
- -1 for non-regular/unreadable files (e.g., broken symlinks, ENOENT, EACCES, transient IO errors)
DiB_loadFiles() correctly skips files with fileSize <= 0.
DiB_fileStats(), however, only skipped fileSize == 0. It counted negative sizes as samples and added MIN(fileSize, SAMPLESIZE_MAX) to totalSizeToLoad. For fileSize == -1, this subtracts 1 from totalSizeToLoad per invalid entry and inflates the expected sample count.

Impact

With enough invalid entries in the training set (or just a few if the valid samples are small), totalSizeToLoad can go negative.
The memory sizing logic casts negative totals to size_t, wrapping to a massive value. This can result in:
malloc(huge) → failure → EXM_THROW(12) “not enough memory for DiB_trainFiles”, or
minor-to-large drift between the intended and actual allocation sizes, depending on the exact inputs.

Even when it doesn’t crash, this mismatch can under/over-allocate sample memory. I'm not an expert in the trainer tool though, so I'm not sure if that really matters or if it can lead to the trained dicts being suboptimal?

Reproduction (added in `96fdb9b`)

Minimal setup: a handful of small “good” files plus many nonexistent paths passed on the command line (nonexistent paths bypass symlink filtering).

The temporary debug logs show negative file sizes being counted by DiB_fileStats(), final totals going negative, and the resulting huge loadedSize after casting.

Before (current behavior)

With 5 valid samples plus many nonexistent files:

[DEBUG FINAL] fileStats: nbSamples=1005, totalSizeToLoad=-970 (NEGATIVE!)
[DEBUG] Memory calc: totalSizeToLoad=-970, maxMem=0, loadedSize=18446744073709550646 (0xfffffffffffffc36)
[BUG] totalSizeToLoad is NEGATIVE! This will cause allocation issues!
[DEBUG] About to malloc: srcBuffer size=18446744073709550678, sampleSizes array size=8040
Error 12 : not enough memory for DiB_trainFiles

After (this change, 85f4a7e):

Same scenario, after skipping fileSize <= 0 in DiB_fileStats():
[DEBUG FINAL] fileStats: nbSamples=5, totalSizeToLoad=30 (ok)
[DEBUG] Memory calc: totalSizeToLoad=30, maxMem=8388608, loadedSize=30 (0x1e)
[DEBUG] About to malloc: srcBuffer size=62, sampleSizes array size=40

The “before” output produces a negative total and a huge loadedSize, ending in EXM_THROW(12).
The “after” output shows consistent nbSamples, sane totals, and modest allocations, proceeding to training as expected.

meta-cla · 2025-09-16T07:21:47Z

Hi @neiljohari!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

Cyan4973

Indeed, DiB_getFileSize() return -1 when the file doesn't exist.
Great work @neiljohari !

Cyan4973 · 2025-09-16T16:32:11Z

Note @neiljohari that signing the CLA is required before getting the code merged.
It's only needed once, and is then also valid for any other Meta open-source project.

neiljohari · 2025-09-16T20:14:21Z

Note @neiljohari that signing the CLA is required before getting the code merged. It's only needed once, and is then also valid for any other Meta open-source project.

Thanks Yann! Going through our internal process for approval, will be done shortly 😄

neiljohari added 3 commits September 15, 2025 23:58

Add debug logging and simple repro

96fdb9b

Fix bug

85f4a7e

Remove debug logging

236e44f

Cyan4973 self-assigned this Sep 16, 2025

Cyan4973 approved these changes Sep 16, 2025

View reviewed changes

meta-cla bot added the CLA Signed label Sep 20, 2025

Cyan4973 merged commit e6e5a95 into facebook:dev Sep 20, 2025
104 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

make DiB_fileStats skip invalid files (fileSize <= 0) to prevent negative totals and bogus allocation #4487

make DiB_fileStats skip invalid files (fileSize <= 0) to prevent negative totals and bogus allocation #4487

Uh oh!

neiljohari commented Sep 16, 2025 •

edited

Loading

Uh oh!

meta-cla bot commented Sep 16, 2025

Uh oh!

Cyan4973 left a comment

Uh oh!

Cyan4973 commented Sep 16, 2025

Uh oh!

neiljohari commented Sep 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

make DiB_fileStats skip invalid files (fileSize <= 0) to prevent negative totals and bogus allocation #4487

make DiB_fileStats skip invalid files (fileSize <= 0) to prevent negative totals and bogus allocation #4487

Uh oh!

Conversation

neiljohari commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation and background

Root cause

Impact

Reproduction (added in 96fdb9b)

Before (current behavior)

Uh oh!

meta-cla bot commented Sep 16, 2025

Action Required

Process

Uh oh!

Cyan4973 left a comment

Choose a reason for hiding this comment

Uh oh!

Cyan4973 commented Sep 16, 2025

Uh oh!

neiljohari commented Sep 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

neiljohari commented Sep 16, 2025 •

edited

Loading

Reproduction (added in `96fdb9b`)